 # Advanced Regression Techniques with Examples

In this section, we will consider sine regression, one-predictor logistics regression, and one-predictor Poisson regression. First, consider data that has an oscillating component.

Nonlinear Regression

Example 6.4. Model Shipping by Month.

Management is asking for a model that explains the behavior of tons of material shipped over time so that predictions might be made concerning future allocation of resources. Table 6.4 shows logistical supply train information collected over 20 months.

TABLE 6.4: Total Shipping Weight vs. Month

 Month Shipped (tons) Month Shipped (tons) 1 20 11 19 2 15 12 25 3 10 13 32 4 18 14 26 5 28 15 21 6 18 16 29 7 13 17 35 8 21 18 28 9 28 19 22 10 22 20 32

First, we find the correlation coefficient. According to our rules of thumb, 0.67 is a moderate to strong value for linear correlation. So is the model to use linear? Plot the data, looking for trends and patterns. Figure 6.4a shows the data as a scatterplot, while 6.4b “connects the dots.” FIGURE 6.4: Shipping Data Graphs

Although linear regression can be used here, it will not capture the seasonal trends. There appears to be an oscillating pattern with a linear upward trend. For purposes of comparison, find a linear model. The R1 value of 0.45 does not indicate a strong fit of the data as we expected. Since we need to represent oscillations with a slight linear upward trend, we’ll try a sine model with a linear component As noted before, good estimates of the parameters a* are necessary for obtaining a good fit. Check Maple’s fit with default parameter values! Use the linear fit from above for estimating oo and ai; use your knowledge of trigonometry to estimate the other parameters. Use Nonlinear Regression from the PSMv2 package. The coefficient’s p-values look very good, save the phase shift a4. Plot the model with the data. This model captures the oscillations and upward trend nicely. The sum of squared error is only SSE = 21.8. The new SSE is quite a bit smaller than that of the linear model. Clearly the model based on sine+linear regression does a much better job in predicting the trends than just using a simple linear regression.

Example 6.5. Modeling Casualties in Afghanistan.

In a January 2010 news report, General Barry McCaffrey, USA, Retired, stated that the situation in Afghanistan would be getting much worse.6 General McCaffrey claimed casualties would double over the next year. The problem is to analyze the data to determine whether it supports his assertion.

The data that Gen. McCaffrey used for his analysis was the available 2001- 2009 figures shown in Table 6.5. The table also shows casualties for 2010 and part of 2011 that were not available at the time.

TABLE 6.5: Casualties in Afghanistan by Month

 Month 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 1 12 10 25 6 7 21 19 83 199 308 2 13 9 17 5 17 39 18 52 247 245 3 53 14 12 16 7 26 53 78 346 345 4 8 13 11 29 13 61 37 60 307 411 5 2 8 31 34 39 87 117 156 443 6 3 4 34 60 68 100 167 213 583 7 6 10 25 38 59 100 151 394 667 8 2 13 22 72 56 103 167 493 631 9 5 19 34 47 70 88 122 390 674 10 5 8 5 38 27 68 131 90 348 631 11 10 4 27 18 12 51 75 37 214 605 12 28 6 12 9 20 23 46 50 168 359

First, do a quick “reasonability model.” Sum the numbers across the years that Gen. McCaffrey had data for (Table 6.6) and graph a scatterplot.

TABLE 6.6: Casualties in Afghanistan by Year

 2002 2003 2004 2005 2006 2007 2008 2009 122 144 276 366 478 877 1028 2649

The scatterplot’s shape suggests that we use a parabola as our “reason- ability model.” The model’s prediction, while much smaller than the actual 2010 value, is not a doubling. However, the model does suggest further analysis is required.

We will focus on the four years before 2010, that is 2006 to 2009, and ask, do we expect the casualties in Afghanistan to double over the next year, 2010, based on those casualty figures? In the same fashion as before, plot both a scatterplot and a line plot of the data available to Gen. McCaffrey over that period. See Figure 6.5. The line plot may better show trends in the data, such as an upward tendency or oscillations that are not apparent in the scatterplot. However, a line plot can be very difficult to read or interpret when there are a large number of data points connected. Good graphing is always a balancing act. After modeling the data from 2006 to 2009, we can use the 2010 values to test our model for goodness of prediction. There are two trends apparent from the graphs. First, the data oscillates seasonally. This time, however, the oscillations grow in magnitude. We will try to capture that with an x • sin(ai) term. Second, the data appears to have an overall upward trend. We will attempt to capture that feature with a linear component. The nonlinear model we choose is Using the techniques described in the previous example, we fit the nonlinear model: a growing-amplitude sine plus a linear trend. We estimate the parameters from the scatterplot:  FIGURE 6.5: Afghanistan Casualties Graphs

Use our NonlinearRegression program. The p=values for all the parameters, except the constant term, are quite good. Plot a graph to see the model capturing the oscillations and linear growth fairly well. Does the model also show the increase in amplitude as well? Considering the residuals will be our next diagnostic. Now graph the residuals looking for patterns and warning signs. The residual plot shows no clear pattern suggesting the model appears to be adequate. Although we note that the model did not “keep up” with the change in amplitude of the oscillations.

What does the model predict for 2010 in relation to 2009? This model does not show a doubling effect from year four. Thus, the model does not support General McCaffrey’s hypothesis.

Consider the ratios of casualties for each month of 2009 to 2008 and then 2010 to 2009. How would this information affect your conclusions?

Logistic Regression and Poisson Regression

Often the dependent variable has special characteristics. Here we examine two notable cases: (a) logistic regression, also known as a logit model, where the dependent variable is binary, and (b) Poisson regression where the dependent variable measures integer counts that follow a Poisson distribution.

One-Predictor Logistic Regression

We begin with three one-predictor logistic regression model examples in which the dependent variable is binary, i.e., {0,1}. The logistic regression model form that we will use is The logistic function, approximating a unit step function, gave the name logistic regression. The most general form handles dependent variables with a finite number of states.

Example 6.6. Damages versus Flight Time.

After a number of hours of flight time, equipment is either damaged or not. Let the dependent variable у be a binary variable with and let t be the flight time in hours.

Over a reporting period, the data of Table 6.7 has been collected.

TABLE 6.7: Damage vs. Flight Time

 t 4 2 4 3 9 6 2 11 6 7 3 2 5 3 3 8 У 1 1 0 1 0 0 0 0 1 0 1 1 0 0 0 0 t 10 5 13 7 3 4 2 3 2 5 6 6 3 4 10 У 0 1 0 0 1 0 1 1 0 0 0 1 0 1 0

Calculate a logistic regression for damage. Now, the fit.  The analyst must decide over what intervals of x we call the у probability a 1 or a 0 using the logistic 5-curve shown from the fit.

We switch from times to time differentials in the next example.

Example 6.7. Damages vs. Time Differentials.

Replace the times in the previous example with time differentials given in Table 6.8.

TABLE 6.8: Damage vs. Time Differentials (TD)

 TD 19.2 24.1 -7.1 3.9 4.5 10.6 -3 16.2 У 1 1 0 1 0 0 0 0 TD 72.8 28.7 11.5 56.3 -0.5 -1.3 12.9 34.1 У 1 0 0 1 0 0 1 1 TD 6.6 -2.5 24.2 2.3 36.9 -11.7 2.1 10.4 У 0 0 0 0 1 0 1 1 TD 9.1 2 12.6 18 1.5 27.3 -8.4 У 0 0 0 1 0 1 0

Repeat the procedure of the previous example. Once again, the analyst must decide over what intervals of x we call the у probability a 1 or a 0 using the logistic 5-curve shown above.

Dehumanization is not a new phenomenon in human conflict. Societies have dehumanized their adversaries since the beginnings of civilization in order to allow them to seize, coerce, maim, or ultimately to kill while avoiding the pain of conscience for committing these extreme, violent actions. By taking away the human traits of these opponents, adversaries are made to be objects deserving of wrath and meriting the violence as justice. Dehumanization still occurs today in both developed and underdeveloped societies. The next example analyzes the impact that dehumanization has in its various forms on the outcome of a state’s ability to win a conflict.

Example 6.8. Conflict and Dehumanization.

To examine dehumanization as a quantitative statistic, we combine a data set of 25 conflicts from Erik Melander, Magnus Oberg, and Jonathan Hall’s

“Uppsala Peace and Conflict,” (Table 1, pg. 25) with .Toakim Kreutz’s “How and When Armed Conflicts End: Introducing the UCDP Conflict Termination Dataset” to have a designated binary “win-lose” assessment for each conflict. We will use civilian casualties as a proxy indicator of the degree of dehumanization during the conflict. The conflicts in Table 6.9 run the gamut from high- to low-intensity in the spectrum, and include both inter- and intra-state hostilities. Therefore, the data is a reasonably general representation.

TABLE 6.9: Top 25 Worst Conflicts Estimated by War-Related Deaths

 Year Side A Side D Side A: Win= 1 Lose= 0 Civilian (1,000s) Military (1,000s) Percentage Civilian Deaths 1946-48 India CPI 1 800 0 100.0 1949-62 Columbia Mil. Junta 1 200 100 66.67 1950-51 China Taiwan 1 1,000 * 100.0 1950-53 Korea South Korea 0 1,000 1,889 34.60 1954-62 Algeria/France FLN 0 82 18 82.00 1956-59 China Tibet 1 60 40 60.00 1956-65 Rwanda/Tutsi Hutu 0 102 3 97.14 1961-70 Iraq KDP 1 100 5 95.24 1963-72 Sudan Anya Nya 1 250 250 50.00 1965-66 Indonesia OPM 1 500 * 100.0 1965-75 N. Vietnam S. Vietnam 1 1,000 1,058 48.59 1966-87 Guatemala FAR 1 100 38 72.46 1967-70 Nigeria Rep. Biafra 1 1,000 1,000 50.00 1967-70 Egypt Israel 0 50 25 66.67 1971-71 Bangladesh JSS/SB 1 1,000 500 66.67 1971-78 Uganda Military Fact. 1 300 0 100.0 1972-72 Burundi Military Fact. 1 80 20 80.00 1974-87 Ethiopia OLF 1 500 46 91.58 1975-90 Lebanon LNM 1 76 25 75.25 1975-78 Cambodia Khmer Rouge 0 1,500 500 75.00 1975-87 Angola FNLA 1 200 13 93.90 1978-87 Afghanistan USSR 1 50 50 50.00 1979-87 El Salvador FMLN 1 50 15 76.92 1981-87 Uganda Kikosi Maalum 1 100 2 98.04 1981-87 Mozambique Renamo 1 350 51 87.28

denotes missing values.

Bv including the ratio of civilian casualties to total casualties in Table 6.9, we are able to determine what percentage of casualties in each conflict is civilian. This ratio provides a quantifiable variable to analyze.

Binary logistic regression analysis is the first method to choose to analyze the interrelation of dehumanization’s effects (shown by proxy through higher percentages of civilian casualties) on the outcome of conflict as a win (1) or a loss (0). This type of regression model will allow us to infer whether or not the independent variable, civilian casualties percentage, has a statistically significant impact on the conflict’s outcome, win or lose. Using the data from Table 6.9, we assign the civilian casualty percentages to be the independent variable and Side A’s win/loss outcome of the conflict to be the binary dependent variable, then develop a binary logistic regression model. Use Maple to derive the logistic regression statistics from the model as follows. We derive estimates of the parameters from the data. (See, e.g., Bauldry [B1997] for simple methods.) Take a = —1.9 and b = 0.05 initially. This result does not pass the common sense test. Ask Maple for more information by increasing infolevel. Maple’s NonlinearFit could not optimize the regression. Let’s try our Nonlinear Regression. This logistic model result appears much better at first look. However, the coefficients’ P-values tell us to have no confidence in the model. Graph the model with the data!

Analysis Interpretation: The conclusion from our analysis is that the civilian casualty percentages are not significantly correlated with whether the conflict leads to a win or a loss for Side A. Therefore, from this initial study, we can loosely conclude that dehumanization does not have a significant effect on the outcome of a state’s ability to win or lose a conflict . Further investigation will be necessary.

One-Predictor Poisson Regression

According to Devore [D2012], the simple linear regression model is defined by:

There exists parameters do, di, and a2, such that for any fixed input value of x, the dependent variable is a random variable related to x through the model equation Y = do + di* + £. The quantity e in the model equation is the “error”—a random variable assumed to be normally distributed with mean 0 and variance a2.

We expand this definition to when the response variable у is assumed to have a normal distribution with mean py and variance a2. We found that the mean could be modeled as a function of our multiple predictor variables, xi,X2, ■ ■ ■, xn, using the linear function Y = do + di3-’! + l%x2 + • • • + dfc^fcThe key assumptions for least squares are

• • the relationship between dependent and independent variables is linear,
• • errors are independent and normally distributed, and
• • homoscedasticity of the errors.

If any assumption is not satisfied, the model’s adequacy is questioned. In first courses, patterns seen or not seen in residual plots are used to gain information about a model’s adequacy. (See [AA1979], [D2012]).

Normality Assumption Lost

In logistic and Poisson regression, the response variable’s probability lies between 0 and 1. According to Neter [NKNW1996], this constraint loses both the normality and the constant variance assumptions listed above. Without these assumptions, the F and t tests cannot be used for analyzing the regression model. When this happens, transform the model and the data with a logistic transformation of the probability p, called logit p, to map the interval [0,1] to (—oo,+oo), eliminating the 0-1 constraint: The /3s can now be interpreted as increasing or decreasing the “log odds” of an event, and exp(/3) (the “odds multiplier”) can be used as the odds ratio for a unit increase or decrease in the associated explanatory variable.

When the response variable is in the form of a count, we face a yet different constraint . Counts are all positive integers corresponding to rare events. Thus, a Poisson distribution (rather than a normal distribution) is more appropriate since the Poisson has a mean greater than 0, and the counts are all positive integers. Recall that the Poisson distribution gives the probability of у events occurring in time period t as Then the logarithm of the response variable is linked to a linear function of explanatory variables. Thus In other words, a Poisson regression model expresses the “log outcome rate” as a linear function of the predictors, sometimes called “exposure variables.”

Assumptions in Poisson Regression

There are several key assumptions in Poisson regression that are different from those in the simple linear regression model. These assumptions include that the logarithm of the dependent variable changes linearly with equal incremental increases in the exposure variable; i.e., the relationship between the logarithm of the dependent variable and the independent variables is linear. For example, if we measure risk in exposure per unit time with one group as counts per month, while another is counts per years, we can convert all exposures to .strictly counts. We find that changes in the rate from combined effects of different exposures are multiplicative; i.e., changes in the log of the rate from combined effects of different exposures are additive. We find for each level of the covariates, the number of cases has variance equal to the mean, making it follow a Poisson distribution. Further, we assume the observations are independent.

Here, too, we use diagnostic methods to identify violations of the assumptions. To determine whether variances are too large or too small, plot residuals versus the mean at different levels of the predictor variables. Recall that in simple linear regression, one diagnostic of the model used plots of residuals against fits (fitted values). We will look for patterns in the residual or deviation plots as our main diagnostic tool for Poisson regression.

Poisson Regression Model

The basic model for Poisson regression is The ith case mean response is denoted by it,, where u, can be one of many defined functions (Neter [NKNW1996]). We will only use the form We assume that the Y, are independent Poisson random variables with expected value щ.

In order to apply regression techniques, we will use the likelihood function L (see [AA1979, D2012]) given by Maximizing this function is intrinsically quite difficult. Instead, maximize the logarithm of the likelihood function shown below. Numerical techniques are used to maximize ln(L) to obtain the best estimates for the coefficients of the model. Often, “good” starting points are required to obtain convergence to the maximum ([Fox2012]).

The deviations or residuals will be used to analyze the model. In Poisson regression, the deviance is given by where щ is the fitted model; whenever Y, = 0, we set Y, ■ 1п(У)/г1;) = 0.

Diagnostic testing of the coefficients is carried out in the same fashion as for logistic regression. To estimate the variance-covariance matrix, use the Hessian matrix //(X), the matrix of second partial derivatives of the log- likelihood function ln(L) of (6.4). Then the approximated variance-covariance matrix is FC(X, В) = —//(X)-1 evaluated at B. the final estimates of the coefficients. The main diagonal elements of VC are estimates for the variance; the estimated standard deviations seg are the square roots of the main diagonal elements. Then perform hypothesis tests on the coefficients using t-tests. Two examples using the Hessian follow.

Example 6.9. Hessian-based Modeling.

Consider the model у; = exp(fo0 + /qa;,) for г =1. 2. ..., n.

Put this model into (6.4) to obtain The Hessian H = [fty] comes from which gives the estimate of the variance-covariance matrix VC = —H^=g- For the two-parameter model (bo and b), the Hessian is Change the model slightly adding a second independent variable with a third parameter. The model becomes у; = exp(6o -Mqaq, + 1)2*2;) for * = 1, 2, ..., n.

Compute the new Hessian and carefully note the similarities. The pattern in the matrix is easily extended to obtain the Hessian for a model with n independent variables.

Let г/i = exp(fro + bixu + 62*2, + ■ ■ • + bnxrn). The general Poisson model Hessian is Replace the formulas with numerical values from the data. The resulting symmetric square matrix should be non-singular. Compute the inverse of the negative of the Hessian matrix to find the variance-covariance matrix VC. The main diagonal entries of VC are the (approximate) variances of the estimated coefficients h,. The square roots of the entries on the main diagonal are the estimates of se(bi), the standard error for Ьг, to be used in the hypothesis testing with t* = b;/se(6;).

We now have all the information we need to build the tables for a Poisson regression that are similar to a regression program’s output .

Estimating the Regression Coefficients: Summary

The number of predictor variables plus one (for the constant term) gives the number of coefficients in the model у; = exp(£>o + bХц + Ь^хц H-----Ь bnxni).

Estimates of the 6,; are the final values from the numerical search method (if it converged) used to maximize the log-likelihood function ln(L) of (6.4). The values of se(bj), the standard error estimate for /;,. are the square roots of the main diagonal of the variance-covariance matrix VC = —//(X)^^. The values of t* = bi/se(bi) and the p-value, the probability P(T > |f*|). In the summary table of Poisson regression analysis below, let m be the number of variables in the model, and let к be the number of data elements of y, the dependent variable. A summary appears in Table 6.10.

TABLE 6.10: Poisson Regression Variables Summary

 Degrees of Freedom (df) Deviance Mean Deviance (MDev) Ratio Regression    Residual Dres = result from the full model with m predictors  Total Dt = result from reduced model у = eb°  Note that a prerequisite for using Poisson regression is that the dependent variable Y must be discrete counts with large numbers being a rare event.

We have chosen two data sets that have published solutions to be our basic examples. First, an outline of the procedure:

Step 0. Enter the data for X and Y.

Step 1. For Y:

• (a) generate a histogram, and
• (b) perform a chi-squared goodness-of-fit test for a Poisson distribution.11

If Y follows a Poisson distribution, then continue. If Y is “count data,” use Poisson regression regardless of the chi-squared test.

Step 2. Compute the value of bo in the constant model у = exp(feo) that minimizes (6.5); i.e., minimize two times the deviations.

Step 3. Compute the values of bo and b in the model у = exp(bo + l>x) that minimize the deviation (6.5).

Step 4. Interpret the results and the odds ratio.

We’ll step through an example following the outline above.

Example 6.10. Hospital Surgeries.

A group of hospitals has collected data on the numbers of Caesarean surgeries vs. the total number of births (see Table 6.11).

TABLE 6.11: Total Births vs. Caesarean Surgeries

 Total 3246 2750 2507 2371 1904 1501 1272 1080 1027 970 Special 26 24 21 21 21 20 19 18 18 17 Total 739 679 502 236 357 309 192 138 100 95 Special 17 16 16 16 16 15 14 14 13 13

Use the hospitals’ data set to perform a Poisson regression following the steps listed above.

Step 0. Enter the data. Step 1. Plot a histogram, and then perform a Chi-square Goodness-of-fit test on yhc, if appropriate.

(Note: Maple’s Histogram function is in the Statistics package. There are a large number of options for binning the data; we will use frequency scale = absolute to have the heights of the bars equal to the frequency of entries in the associated bin. Collect the bin counts with Tallylnto.)   Now for the chi-squared test. First, generate the predicted values from an estimated Poisson distribution. We are ready to use Maple’s chi-squared test, ChiSquareGoodnessOfFitTest, with a significance level of 0.05. Use the summarize = embed option, as it produces the most readable output. The command is terminated with a colon: “embedding the output” makes it unnecessary to return a result. The chi-squared test indicates that a Poisson distribution is reasonable.

Step 2. Find the best constant model у = exp(fco).

Let’s use Maple’s LinearFit on the function Y = ln(y) = b<). Step 3. Find the best exponential model у = exp(feo + bix). Let’s use Maple’s ExponentialFit to find the model. Step 4. Conclude by calculating the odds-ratio.

Use the odds-multiplier exp(/?i) as the approximate odds-ratio, often called risk-ratio for Poisson regression. OR represents the potential increase resulting from one unit increase in x. (How does this concept relate to “opportunity cost” in linear programming and “marginal revenue” in economics?)

Return to the Philippines example relating literacy and violence described in the opening of this chapter.

Example 6.11. Violence in the Philippines.

The number of significant acts of violence, SigActs in Table 6.12, are integer counts.

TABLE 6.12: Literacy Rate (Lit) vs. Significant Acts of Violence (SigActs), Philippines, 2008.

 Province Lit SigActs Province Lit SigActs Basnlan 71.6 29 Drnagat Istands 85.7 0 Larseao del Sur 71.6 30 Sungapdel Norte 85.7 10 Maguindanso 71.6 122 Sungapdel Sur 85.7 31 Suu 71.6 26 Bukidnon 85.9 14 Tawi-Tawi 71.6 1 Camigum 85.9 0 Bihran 72.9 0 Laraodel Norte 85.9 57 Eastern Samar 72.9 11 Misamis Occidental 85.9 8 Leyte 72.9 2 Misamis Onental 85.9 7 Northern Samar 72.9 23 Batanes 86.1 0 Southern Leyte 72.9 0 Cagayan 86.1 15 Western Samar 72.9 64 Isabela 86.1 4 North Cotabato 78.3 125 Nueva Vizcaya 86.1 3 Sarangani 78.3 23 Quirmo 86.1 0 South Cotabato 78.3 5 Bokal 86.6 2 Suan Ku:iarat 78.3 18 Cebu 86.6 0 Zamboanga del Norte 79.6 8 Negros Onertal 86.6 27 Zamboarga del Sur 79.6 10 Siquyjor 86.6 0 Zamboanga Sibugay 79.6 3 Abra 89.2 11 Albey 79.9 35 Apayap 89.2 0 Camarines Norte 79.9 12 Benguet 89.2 0 Camarines Sur 79.9 44 Ifugao 89.2 0 Caanduancs 79.9 9 Kahinga 89.2 11 Masbate 79.9 42 Mountain Province 89.2 0 Sorsogon 79.9 52 Veces Norte 91.3 0 Compostela Valtey 81.7 126 Lvees Sur 91.3 2 Davaodcl Norte 81.7 35 La Unon 91.3 0 Davaedel Sur 81.7 64 Pangasman 91.3 0 Davao Orental 81.7 40 Aurora 92.1 10 Aklan 82.6 0 Bataan 92.1 1 Artque 82.6 1 Bulacan 92.1 6 Capuz 82.6 8 Nueva Ecya 92.1 4 Guimaras 82.6 0 Pampenga 92.1 3 Iloilo 82.6 8 Tarlac 92.1 4 Negros Occidental 82.6 26 Zambales 92.1 6 Marinduque 83.9 0 Batangas 93.5 5 Occedemta Mindoro 83.9 5 Cavric 93.5 0 Onental Mindoro 83.9 7 Laguna 93.5 4 Palawan 83.9 2 Quezon 93.5 28 Romblon 83.9 0 Rizal 93.5 3 Agusandel Norte 85.7 13 Metropolzian Manila 94 1 Aguxandel Sur 85.7 33

The literacy data has been defined as L, the SigActs as V. Examine the histogram in Figure 6.6 to see that the data appears to follow a Poisson distribution. A goodness-of-fit test (left as an exercise) confirms the data follows a Poisson distribution. FIGURE 6.6: Histogram of SigActs Data

Use Maple to fit the data. First, remove the three outlier data points with values well over 100, as there are other much more significant generators of violence beyond literacy levels in those regions. We cannot use Maple’s ExponentialFit, as it attempts a log-transformation of SigActs which fails due to 0 values.  Plot the fit. We accept that the fit looks pretty good.

The odds multiplier, ebl, for our fit is e05u437 ~ 0.946 which means that for every 1 unit increase in literacy we expect violence to go down « 5.4%. This value suggests improving literacy will help ameliorate the violence.

Poisson Regression with Multiple Predictor Variables in Maple

Often, there are many variables that influence the outcome under study. We’ll add a second predictor to the Hospital Births problem.

Example 6.12. Hospital Births Redux.

Revisit Example 6.10 with an additional predictor: the type of hospital, rural (0) or urban (1). the new data appears in Table 6.13.

TABLE 6.13: Total Births vs. Caesarean Surgeries and Hospital Type

 Total 3246 2750 2507 2371 1904 1501 1272 1080 1027 970 Special 26 24 21 21 21 20 19 18 18 17 Type 1 1 1 1 1 1 1 1 1 1 Total 739 679 502 236 357 309 192 138 100 95 Special 17 16 16 16 16 15 14 14 13 13 Type 1 1 1 1 1 0 1 0 0 0

The data has been entered as B: Total, C: Special, and T: Type. After loading the Statistics package, define the model. Collect the data and use NonlinearFit to fit the model. Finishing the statistical analysis of the model is left as an exercise.

Exercises

• 1. Adjust the nonlinear model for Afghanistan casualties, Example 6.5, to increase the amplitude of the sine term more quickly. How does the conclusion change, if at all?
• 2. Investigate the action of parameters in the logistic function by executing the Maple statements below using the Explore command to make an interactive graph. 3. For the data in Table 6.14 (a) plot the data and (b) state the type of regression that should be used to model the data.

 Number Hours Tread (cm) 1 2 5.4 2 5 5.0 3 7 4.5 4 10 3.7 5 14 3.5 6 19 2.5 7 26 2.0 8 31 1.6 9 34 1.8 10 38 1.3 11 45 0.8 12 52 1.1 13 53 0.8 14 60 0.4 15 65 0.6

4. Assume the suspected nonlinear model for the data of Table 6.15 is If we use a log-log transformation, we obtain Use regression techniques to estimate the parameters a, b, and c, and statistically analyze the resulting coefficients.

TABLE 6.15: Nonlinear Data

 X У Z 101 15 0.788 73 3 304.149 122 5 98.245 56 20 0.051 107 20 0.270 77 5 30.485 140 15 1.653 66 16 0.192 109 5 159.918 103 14 1.109 93 3 699.447 98 4 281.184 76 14 0.476 83 5 54.468 113 12 2.810 167 6 144.923 82 5 79.733 85 6 21.821 103 20 0.223 86 11 1.899 67 8 5.180 104 13 1.334 114 5 110.378 118 21 0.274 94 5 81.304
• 5. Using the basic linear model у = j3o + f3x, fit the following data sets. Provide the model, the analysis of variance information, the value of R2, and a residual plot.
• (а)
 X 100 125 125 150 150 200 200 У 150 140 180 210 190 320 280 X 250 250 300 300 350 400 400 У 400 430 440 390 600 610 670

(b) The following data represents change in growth where x is body weight and у is normalized metabolic rate for 13 animals.

 X no 115 120 230 235 240 360 У 198 173 174 149 124 115 130 X 362 363 500 505 510 515 У 102 95 122 112 98 96

6. Use an appropriate multivariable-model for the following ten observations of college acceptances to graduate school of GRE score, high school GPA, highly selective college, and whether the student was admitted. 1 indicates “Yes” and 0 indicates “No.”

 GPA GRE Selective Admitted 3.61 380 0 1 3.67 660 1 0 4.00 800 1 0 3.19 640 0 0 2.93 520 0 1 3.00 760 0 0 2.98 560 0 0 3.08 400 0 1 3.39 540 0 0 3.92 700 1 1

7. The data set for lung cancer in relation to cigarette smoking in Table 6.16 is from Frome, Biometrics 39, 1983, pg. 665-674. The number of person years in parentheses is broken down by age and daily cigarette consumption. Find and analyze an appropriate multivariate model.

TABLE 6.16: Lung Cancer Rates for Smokers and Nonsmokers

 Age Number Smoked per day Nonsmokers 1-9 10-14 15-19 20-24 25-34 > 35 15-20 1 (10366) 0 (3121) 0 (3577) 0 (4319) 0 (5683) 0 (3042) 0 (670) 20-25 0 (8162) 0 (2397) 1 (3286) 0 (4214) 1 (6385) 1 (4050) 0 (1166) 25-30 0 (5969) 0 (2288) 1 (2546) 0 (3185) 1 (5483) 4 (4290) 0 (1482) 30-35 0 (4496) 0 (2015) 2 (2219) 4 (2560) 6 (4687) 9 (4268) 4 (1580) 35-40 0 (3152) 1 (1648) 0 (1826) 0 (1893) 5 (3646) 9 (3529) 6 (1136) 40-45 0 (2201) 2 (1310) 1 (1386) 2 (1334) 12 (2411) 11 (2424) 10 (924) 45-50 0 (1421) 0 (927) 2 (988) 2 (849) 9 (1567) 10 (1409) 7 (556) 50-55 0 (1121) 3 (710) 4 (684) 2 (470) 7 (857) 5 (663) 4 (255) >55 2 (826) 0 (606) 3 (449) 5 (280) 7 (416 3 (284) 1 (104)

8. Model absences from class where:

School: school 1 or school 2 Gender: female is 1, male is 2 Ethnicity: categories 1 through 6 Math Test: score Language Test: score

Bilingual: categories 1 through 4

 School Gender Ethnicity Math Score Lang. Score Bilingual Status Days Absent 1 2 4 56.98 42.45 2 4 1 2 4 37.09 46.82 2 4 2 1 4 32.37 43.57 2 2 1 1 4 29.06 43.57 2 3 2 1 4 6.75 27.25 3 3 1 1 4 61.65 48.41 0 13 1 1 4 56.99 40.74 2 11 2 2 4 10.39 15.36 2 7 1 2 4 50.52 51.12 2 10 1 2 6 49.47 42.45 0 9

Projects

Project 1. Fit, analyze, and interpret your results for the nonlinear model у = a th with the data provided below. Produce fit plots and residual graphs with your analysis.

Project 2. Fit, analyze, and interpret your results for an appropriate model with the data provided below. Produce fit plots and residual graphs with your analysis.

 Year 0 1 2 3 4 5 6 7 8 9 10 Quantity 15 150 250 275 270 280 290 650 1200 1550 2750
 t 7 14 21 28 35 42 У 8 41 133 250 280 297

Project 3. Fit, analyze, and interpret your results for the nonlinear model у = atb with the data provided by executing the Maple code below. Produce fit plots and residual graphs with your analysis. Use your phone number (no dashes or parentheses) for PN. •  See David L. Smith, Less Than Human: Why We Demean, Enslave, and ExterminateOthers.
•  E. Melander, M. Oberg, and J. Hall, “The ‘New Wars’ Debate Revisited: An Empirical Evaluation of the Atrociousness of ‘New Wars’,” Uppsala Univ. Press, Uppsala, 2006.Available at www.pcr.uu.se/digitalAssets/654/c_654444-l_l-k_uprp_no_9.pdf.
•  J. Kreutz, “How and When Armed Conflicts End: Introducing the UCDP ConflictTermination Dataset,” J. Peace Research, 47(2), 2010, 243-250.
•  ’“Homoscedasticity: All random variables have the same finite variance.
•  Adaptecl from “Research Methods II: Multivariate Analysis,” J. Trop. Pediatrics,Online Feature, (2009), pp. 136-143. Originally at: www.oxfordjournals.org/our_journals/tropej/online/ma_chapl3.pdf.
•  1:iData sources: National Statistics Office (Manila, Philipppines) and the Archives of theArmed Forces of the Philippines.