Advanced Regression Techniques with Examples
In this section, we will consider sine regression, onepredictor logistics regression, and onepredictor Poisson regression. First, consider data that has an oscillating component.
Nonlinear Regression
Example 6.4. Model Shipping by Month.
Management is asking for a model that explains the behavior of tons of material shipped over time so that predictions might be made concerning future allocation of resources. Table 6.4 shows logistical supply train information collected over 20 months.
TABLE 6.4: Total Shipping Weight vs. Month
Month 
Shipped (tons) 
Month 
Shipped (tons) 
1 
20 
11 
19 
2 
15 
12 
25 
3 
10 
13 
32 
4 
18 
14 
26 
5 
28 
15 
21 
6 
18 
16 
29 
7 
13 
17 
35 
8 
21 
18 
28 
9 
28 
19 
22 
10 
22 
20 
32 
First, we find the correlation coefficient.
According to our rules of thumb, 0.67 is a moderate to strong value for linear correlation. So is the model to use linear? Plot the data, looking for trends and patterns. Figure 6.4a shows the data as a scatterplot, while 6.4b “connects the dots.”
FIGURE 6.4: Shipping Data Graphs
Although linear regression can be used here, it will not capture the seasonal trends. There appears to be an oscillating pattern with a linear upward trend. For purposes of comparison, find a linear model.
The R^{1} value of 0.45 does not indicate a strong fit of the data as we expected. Since we need to represent oscillations with a slight linear upward trend, we’ll try a sine model with a linear component
As noted before, good estimates of the parameters a* are necessary for obtaining a good fit. Check Maple’s fit with default parameter values! Use the linear fit from above for estimating oo and ai; use your knowledge of trigonometry to estimate the other parameters.
Use Nonlinear Regression from the PSMv2 package.
The coefficient’s pvalues look very good, save the phase shift a4. Plot the model with the data.
This model captures the oscillations and upward trend nicely. The sum of squared error is only SSE = 21.8. The new SSE is quite a bit smaller than that of the linear model. Clearly the model based on sine+linear regression does a much better job in predicting the trends than just using a simple linear regression.
Example 6.5. Modeling Casualties in Afghanistan.
In a January 2010 news report, General Barry McCaffrey, USA, Retired, stated that the situation in Afghanistan would be getting much worse.^{6} General McCaffrey claimed casualties would double over the next year. The problem is to analyze the data to determine whether it supports his assertion.
The data that Gen. McCaffrey used for his analysis was the available 2001 2009 figures shown in Table 6.5. The table also shows casualties for 2010 and part of 2011 that were not available at the time.
TABLE 6.5: Casualties in Afghanistan by Month
Month 
2001 
2002 
2003 
2004 
2005 
2006 
2007 
2008 
2009 
2010 
2011 
1 
12 
10 
25 
6 
7 
21 
19 
83 
199 
308 

2 
13 
9 
17 
5 
17 
39 
18 
52 
247 
245 

3 
53 
14 
12 
16 
7 
26 
53 
78 
346 
345 

4 
8 
13 
11 
29 
13 
61 
37 
60 
307 
411 

5 
2 
8 
31 
34 
39 
87 
117 
156 
443 

6 
3 
4 
34 
60 
68 
100 
167 
213 
583 

7 
6 
10 
25 
38 
59 
100 
151 
394 
667 

8 
2 
13 
22 
72 
56 
103 
167 
493 
631 

9 
5 
19 
34 
47 
70 
88 
122 
390 
674 

10 
5 
8 
5 
38 
27 
68 
131 
90 
348 
631 

11 
10 
4 
27 
18 
12 
51 
75 
37 
214 
605 

12 
28 
6 
12 
9 
20 
23 
46 
50 
168 
359 
First, do a quick “reasonability model.” Sum the numbers across the years that Gen. McCaffrey had data for (Table 6.6) and graph a scatterplot.
TABLE 6.6: Casualties in Afghanistan by Year
2002 
2003 
2004 
2005 
2006 
2007 
2008 
2009 
122 
144 
276 
366 
478 
877 
1028 
2649 
The scatterplot’s shape suggests that we use a parabola as our “reason ability model.”
The model’s prediction, while much smaller than the actual 2010 value, is not a doubling. However, the model does suggest further analysis is required.
We will focus on the four years before 2010, that is 2006 to 2009, and ask, do we expect the casualties in Afghanistan to double over the next year, 2010, based on those casualty figures?
In the same fashion as before, plot both a scatterplot and a line plot of the data available to Gen. McCaffrey over that period. See Figure 6.5. The line plot may better show trends in the data, such as an upward tendency or oscillations that are not apparent in the scatterplot. However, a line plot can be very difficult to read or interpret when there are a large number of data points connected. Good graphing is always a balancing act. After modeling the data from 2006 to 2009, we can use the 2010 values to test our model for goodness of prediction. There are two trends apparent from the graphs. First, the data oscillates seasonally. This time, however, the oscillations grow in magnitude. We will try to capture that with an x • sin(ai) term. Second, the data appears to have an overall upward trend. We will attempt to capture that feature with a linear component. The nonlinear model we choose is
Using the techniques described in the previous example, we fit the nonlinear model: a growingamplitude sine plus a linear trend. We estimate the parameters from the scatterplot:
FIGURE 6.5: Afghanistan Casualties Graphs
Use our NonlinearRegression program.
The p=values for all the parameters, except the constant term, are quite good. Plot a graph to see the model capturing the oscillations and linear growth fairly well. Does the model also show the increase in amplitude as well? Considering the residuals will be our next diagnostic.
Now graph the residuals looking for patterns and warning signs.
The residual plot shows no clear pattern suggesting the model appears to be adequate. Although we note that the model did not “keep up” with the change in amplitude of the oscillations.
What does the model predict for 2010 in relation to 2009?
This model does not show a doubling effect from year four. Thus, the model does not support General McCaffrey’s hypothesis.
Consider the ratios of casualties for each month of 2009 to 2008 and then 2010 to 2009. How would this information affect your conclusions?
Logistic Regression and Poisson Regression
Often the dependent variable has special characteristics. Here we examine two notable cases: (a) logistic regression, also known as a logit model, where the dependent variable is binary, and (b) Poisson regression where the dependent variable measures integer counts that follow a Poisson distribution.
OnePredictor Logistic Regression
We begin with three onepredictor logistic regression model examples in which the dependent variable is binary, i.e., {0,1}. The logistic regression model form that we will use is
The logistic function, approximating a unit step function, gave the name logistic regression. The most general form handles dependent variables with a finite number of states.
Example 6.6. Damages versus Flight Time.
After a number of hours of flight time, equipment is either damaged or not. Let the dependent variable у be a binary variable with
and let t be the flight time in hours.
Over a reporting period, the data of Table 6.7 has been collected.
TABLE 6.7: Damage vs. Flight Time
t 
4 
2 
4 
3 
9 
6 
2 
11 
6 
7 
3 
2 
5 
3 
3 
8 
У 
1 
1 
0 
1 
0 
0 
0 
0 
1 
0 
1 
1 
0 
0 
0 
0 
t 
10 
5 
13 
7 
3 
4 
2 
3 
2 
5 
6 
6 
3 
4 
10 

У 
0 
1 
0 
0 
1 
0 
1 
1 
0 
0 
0 
1 
0 
1 
0 
Calculate a logistic regression for damage. Now, the fit.
The analyst must decide over what intervals of x we call the у probability a 1 or a 0 using the logistic 5curve shown from the fit.
We switch from times to time differentials in the next example.
Example 6.7. Damages vs. Time Differentials.
Replace the times in the previous example with time differentials given in Table 6.8.
TABLE 6.8: Damage vs. Time Differentials (TD)
TD 
19.2 
24.1 
7.1 
3.9 
4.5 
10.6 
3 
16.2 
У 
1 
1 
0 
1 
0 
0 
0 
0 
TD 
72.8 
28.7 
11.5 
56.3 
0.5 
1.3 
12.9 
34.1 
У 
1 
0 
0 
1 
0 
0 
1 
1 
TD 
6.6 
2.5 
24.2 
2.3 
36.9 
11.7 
2.1 
10.4 
У 
0 
0 
0 
0 
1 
0 
1 
1 
TD 
9.1 
2 
12.6 
18 
1.5 
27.3 
8.4 

У 
0 
0 
0 
1 
0 
1 
0 
Repeat the procedure of the previous example.
Once again, the analyst must decide over what intervals of x we call the у probability a 1 or a 0 using the logistic 5curve shown above.
Dehumanization is not a new phenomenon in human conflict. Societies have dehumanized their adversaries since the beginnings of civilization in order to allow them to seize, coerce, maim, or ultimately to kill while avoiding the pain of conscience for committing these extreme, violent actions. By taking away the human traits of these opponents, adversaries are made to be objects deserving of wrath and meriting the violence as justice.^{[1]} Dehumanization still occurs today in both developed and underdeveloped societies. The next example analyzes the impact that dehumanization has in its various forms on the outcome of a state’s ability to win a conflict.
Example 6.8. Conflict and Dehumanization.
To examine dehumanization as a quantitative statistic, we combine a data set of 25 conflicts from Erik Melander, Magnus Oberg, and Jonathan Hall’s
“Uppsala Peace and Conflict,” (Table 1, pg. 25)^{[2]} with .Toakim Kreutz’s “How and When Armed Conflicts End: Introducing the UCDP Conflict Termination Dataset”^{[3]} to have a designated binary “winlose” assessment for each conflict. We will use civilian casualties as a proxy indicator of the degree of dehumanization during the conflict. The conflicts in Table 6.9 run the gamut from high to lowintensity in the spectrum, and include both inter and intrastate hostilities. Therefore, the data is a reasonably general representation.
TABLE 6.9: Top 25 Worst Conflicts Estimated by WarRelated Deaths
Year 
Side A 
Side D 
Side A: Win= 1 Lose= 0 
Civilian (1,000s) 
Military (1,000s) 
Percentage Civilian Deaths 
194648 
India 
CPI 
1 
800 
0 
100.0 
194962 
Columbia 
Mil. Junta 
1 
200 
100 
66.67 
195051 
China 
Taiwan 
1 
1,000 
* 
100.0 
195053 
Korea 
South Korea 
0 
1,000 
1,889 
34.60 
195462 
Algeria/France 
FLN 
0 
82 
18 
82.00 
195659 
China 
Tibet 
1 
60 
40 
60.00 
195665 
Rwanda/Tutsi 
Hutu 
0 
102 
3 
97.14 
196170 
Iraq 
KDP 
1 
100 
5 
95.24 
196372 
Sudan 
Anya Nya 
1 
250 
250 
50.00 
196566 
Indonesia 
OPM 
1 
500 
* 
100.0 
196575 
N. Vietnam 
S. Vietnam 
1 
1,000 
1,058 
48.59 
196687 
Guatemala 
FAR 
1 
100 
38 
72.46 
196770 
Nigeria 
Rep. Biafra 
1 
1,000 
1,000 
50.00 
196770 
Egypt 
Israel 
0 
50 
25 
66.67 
197171 
Bangladesh 
JSS/SB 
1 
1,000 
500 
66.67 
197178 
Uganda 
Military Fact. 
1 
300 
0 
100.0 
197272 
Burundi 
Military Fact. 
1 
80 
20 
80.00 
197487 
Ethiopia 
OLF 
1 
500 
46 
91.58 
197590 
Lebanon 
LNM 
1 
76 
25 
75.25 
197578 
Cambodia 
Khmer Rouge 
0 
1,500 
500 
75.00 
197587 
Angola 
FNLA 
1 
200 
13 
93.90 
197887 
Afghanistan 
USSR 
1 
50 
50 
50.00 
197987 
El Salvador 
FMLN 
1 
50 
15 
76.92 
198187 
Uganda 
Kikosi Maalum 
1 
100 
2 
98.04 
198187 
Mozambique 
Renamo 
1 
350 
51 
87.28 
denotes missing values.
Bv including the ratio of civilian casualties to total casualties in Table 6.9, we are able to determine what percentage of casualties in each conflict is civilian. This ratio provides a quantifiable variable to analyze.
Binary logistic regression analysis is the first method to choose to analyze the interrelation of dehumanization’s effects (shown by proxy through higher percentages of civilian casualties) on the outcome of conflict as a win (1) or a loss (0). This type of regression model will allow us to infer whether or not the independent variable, civilian casualties percentage, has a statistically significant impact on the conflict’s outcome, win or lose. Using the data from Table 6.9, we assign the civilian casualty percentages to be the independent variable and Side A’s win/loss outcome of the conflict to be the binary dependent variable, then develop a binary logistic regression model. Use Maple to derive the logistic regression statistics from the model as follows.
We derive estimates of the parameters from the data. (See, e.g., Bauldry [B1997] for simple methods.) Take a = —1.9 and b = 0.05 initially.
This result does not pass the common sense test. Ask Maple for more information by increasing infolevel.
Maple’s NonlinearFit could not optimize the regression. Let’s try our Nonlinear Regression.
This logistic model result appears much better at first look. However, the coefficients’ Pvalues tell us to have no confidence in the model. Graph the model with the data!
Analysis Interpretation: The conclusion from our analysis is that the civilian casualty percentages are not significantly correlated with whether the conflict leads to a win or a loss for Side A. Therefore, from this initial study, we can loosely conclude that dehumanization does not have a significant effect on the outcome of a state’s ability to win or lose a conflict . Further investigation will be necessary.
OnePredictor Poisson Regression
According to Devore [D2012], the simple linear regression model is defined by:
There exists parameters do, di, and a^{2}, such that for any fixed input value of x, the dependent variable is a random variable related to x through the model equation Y = do + di* + £. The quantity e in the model equation is the “error”—a random variable assumed to be normally distributed with mean 0 and variance a^{2}.
We expand this definition to when the response variable у is assumed to have a normal distribution with mean p_{y} and variance a^{2}. We found that the mean could be modeled as a function of our multiple predictor variables, xi,X2, ■ ■ ■, x_{n}, using the linear function Y = do + di^{3}’! + l%^{x}2 + • • • + dfc^fcThe key assumptions for least squares are
 • the relationship between dependent and independent variables is linear,
 • errors are independent and normally distributed, and
 • homoscedasticity^{[4]} of the errors.
If any assumption is not satisfied, the model’s adequacy is questioned. In first courses, patterns seen or not seen in residual plots are used to gain information about a model’s adequacy. (See [AA1979], [D2012]).
Normality Assumption Lost
In logistic and Poisson regression, the response variable’s probability lies between 0 and 1. According to Neter [NKNW1996], this constraint loses both the normality and the constant variance assumptions listed above. Without these assumptions, the F and t tests cannot be used for analyzing the regression model. When this happens, transform the model and the data with a logistic transformation of the probability p, called logit p, to map the interval [0,1] to (—oo,+oo), eliminating the 01 constraint:
The /3s can now be interpreted as increasing or decreasing the “log odds” of an event, and exp(/3) (the “odds multiplier”) can be used as the odds ratio for a unit increase or decrease in the associated explanatory variable.
When the response variable is in the form of a count, we face a yet different constraint . Counts are all positive integers corresponding to rare events. Thus, a Poisson distribution (rather than a normal distribution) is more appropriate since the Poisson has a mean greater than 0, and the counts are all positive integers. Recall that the Poisson distribution gives the probability of у events occurring in time period t as
Then the logarithm of the response variable is linked to a linear function of explanatory variables.
Thus
In other words, a Poisson regression model expresses the “log outcome rate” as a linear function of the predictors, sometimes called “exposure variables.”
Assumptions in Poisson Regression
There are several key assumptions in Poisson regression that are different from those in the simple linear regression model. These assumptions include that the logarithm of the dependent variable changes linearly with equal incremental increases in the exposure variable; i.e., the relationship between the logarithm of the dependent variable and the independent variables is linear. For example, if we measure risk in exposure per unit time with one group as counts per month, while another is counts per years, we can convert all exposures to .strictly counts. We find that changes in the rate from combined effects of different exposures are multiplicative; i.e., changes in the log of the rate from combined effects of different exposures are additive. We find for each level of the covariates, the number of cases has variance equal to the mean, making it follow a Poisson distribution. Further, we assume the observations are independent.
Here, too, we use diagnostic methods to identify violations of the assumptions. To determine whether variances are too large or too small, plot residuals versus the mean at different levels of the predictor variables. Recall that in simple linear regression, one diagnostic of the model used plots of residuals against fits (fitted values). We will look for patterns in the residual or deviation plots as our main diagnostic tool for Poisson regression.
Poisson Regression Model
The basic model for Poisson regression is
The ith case mean response is denoted by it,, where u, can be one of many defined functions (Neter [NKNW1996]). We will only use the form
We assume that the Y, are independent Poisson random variables with expected value щ.
In order to apply regression techniques, we will use the likelihood function L (see [AA1979, D2012]) given by
Maximizing this function is intrinsically quite difficult. Instead, maximize the logarithm of the likelihood function shown below.
Numerical techniques are used to maximize ln(L) to obtain the best estimates for the coefficients of the model. Often, “good” starting points are required to obtain convergence to the maximum ([Fox2012]).
The deviations or residuals will be used to analyze the model. In Poisson regression, the deviance is given by
where щ is the fitted model; whenever Y, = 0, we set Y, ■ 1п(У)/г1;) = 0.
Diagnostic testing of the coefficients is carried out in the same fashion as for logistic regression. To estimate the variancecovariance matrix, use the Hessian matrix //(X), the matrix of second partial derivatives of the log likelihood function ln(L) of (6.4). Then the approximated variancecovariance matrix is FC(X, В) = —//(X)^{1} evaluated at B. the final estimates of the coefficients. The main diagonal elements of VC are estimates for the variance; the estimated standard deviations seg are the square roots of the main diagonal elements. Then perform hypothesis tests on the coefficients using ttests. Two examples using the Hessian follow.
Example 6.9. Hessianbased Modeling.
Consider the model у; = exp(fo_{0} + /qa;,) for г =1. 2. ..., n.
Put this model into (6.4) to obtain The Hessian H = [fty] comes from
which gives the estimate of the variancecovariance matrix VC = —H^=g For the twoparameter model (bo and b), the Hessian is
Change the model slightly adding a second independent variable with a third parameter. The model becomes у; = exp(6o Mqaq, + 1)2*2;) for * = 1, 2, ..., n.
Compute the new Hessian and carefully note the similarities.
The pattern in the matrix is easily extended to obtain the Hessian for a model with n independent variables.
Let г/i = exp(fro + bixu + 62*2, + ■ ■ • + b_{n}x_{rn}). The general Poisson model Hessian is
Replace the formulas with numerical values from the data. The resulting symmetric square matrix should be nonsingular. Compute the inverse of the negative of the Hessian matrix to find the variancecovariance matrix VC. The main diagonal entries of VC are the (approximate) variances of the estimated coefficients h,. The square roots of the entries on the main diagonal are the estimates of se(bi), the standard error for Ь_{г}, to be used in the hypothesis testing with t* = b;/se(6;).
We now have all the information we need to build the tables for a Poisson regression that are similar to a regression program’s output .
Estimating the Regression Coefficients: Summary
The number of predictor variables plus one (for the constant term) gives the number of coefficients in the model у; = exp(£>o + bХц + Ь^хц HЬ b_{n}x_{n}i).
Estimates of the 6,; are the final values from the numerical search method (if it converged) used to maximize the loglikelihood function ln(L) of (6.4). The values of se(bj), the standard error estimate for /;,. are the square roots of the main diagonal of the variancecovariance matrix VC = —//(X)^^. The values of t* = bi/se(bi) and the pvalue, the probability P(T > f*). In the summary table of Poisson regression analysis below, let m be the number of variables in the model, and let к be the number of data elements of y, the dependent variable. A summary appears in Table 6.10.
TABLE 6.10: Poisson Regression Variables Summary
Degrees of Freedom (df) 
Deviance 
Mean Deviance (MDev) 
Ratio 

Regression 

Residual 
D_{res} = result from the full model with m predictors 

Total 
Dt = result from reduced model у = e^{b}° 
Note that a prerequisite for using Poisson regression is that the dependent variable Y must be discrete counts with large numbers being a rare event.
We have chosen two data sets that have published solutions to be our basic examples. First, an outline of the procedure:
Step 0. Enter the data for X and Y.
Step 1. For Y:
 (a) generate a histogram, and
 (b) perform a chisquared goodnessoffit test for a Poisson distribution.^{11}
If Y follows a Poisson distribution, then continue. If Y is “count data,” use Poisson regression regardless of the chisquared test.
Step 2. Compute the value of bo in the constant model у = exp(feo) that minimizes (6.5); i.e., minimize two times the deviations.
Step 3. Compute the values of bo and b in the model у = exp(bo + l>x) that minimize the deviation (6.5).
Step 4. Interpret the results and the odds ratio.
We’ll step through an example following the outline above.
Example 6.10. Hospital Surgeries.
A group of hospitals has collected data on the numbers of Caesarean surgeries vs. the total number of births (see Table 6.11).^{[5]}
TABLE 6.11: Total Births vs. Caesarean Surgeries
Total 
3246 
2750 
2507 
2371 
1904 
1501 
1272 
1080 
1027 
970 
Special 
26 
24 
21 
21 
21 
20 
19 
18 
18 
17 
Total 
739 
679 
502 
236 
357 
309 
192 
138 
100 
95 
Special 
17 
16 
16 
16 
16 
15 
14 
14 
13 
13 
Use the hospitals’ data set to perform a Poisson regression following the steps listed above.
Step 0. Enter the data.
Step 1. Plot a histogram, and then perform a Chisquare Goodnessoffit test on yhc, if appropriate.
(Note: Maple’s Histogram function is in the Statistics package. There are a large number of options for binning the data; we will use frequency scale = absolute to have the heights of the bars equal to the frequency of entries in the associated bin. Collect the bin counts with Tallylnto.)
Now for the chisquared test. First, generate the predicted values from an estimated Poisson distribution.
We are ready to use Maple’s chisquared test, ChiSquareGoodnessOfFitTest, with a significance level of 0.05. Use the summarize = embed option, as it produces the most readable output. The command is terminated with a colon: “embedding the output” makes it unnecessary to return a result.
The chisquared test indicates that a Poisson distribution is reasonable.
Step 2. Find the best constant model у = exp(fco).
Let’s use Maple’s LinearFit on the function Y = ln(y) = b<).
Step 3. Find the best exponential model у = exp(feo + bix). Let’s use Maple’s ExponentialFit to find the model.
Step 4. Conclude by calculating the oddsratio.
Use the oddsmultiplier exp(/?i) as the approximate oddsratio, often called riskratio for Poisson regression.
OR represents the potential increase resulting from one unit increase in x. (How does this concept relate to “opportunity cost” in linear programming and “marginal revenue” in economics?)
Return to the Philippines example relating literacy and violence described in the opening of this chapter.
Example 6.11. Violence in the Philippines.
The number of significant acts of violence, SigActs in Table 6.12, are integer counts.^{[6]}
TABLE 6.12: Literacy Rate (Lit) vs. Significant Acts of Violence (SigActs), Philippines, 2008.
Province 
Lit 
SigActs 
Province 
Lit 
SigActs 
Basnlan 
71.6 
29 
Drnagat Istands 
85.7 
0 
Larseao del Sur 
71.6 
30 
Sungapdel Norte 
85.7 
10 
Maguindanso 
71.6 
122 
Sungapdel Sur 
85.7 
31 
Suu 
71.6 
26 
Bukidnon 
85.9 
14 
TawiTawi 
71.6 
1 
Camigum 
85.9 
0 
Bihran 
72.9 
0 
Laraodel Norte 
85.9 
57 
Eastern Samar 
72.9 
11 
Misamis Occidental 
85.9 
8 
Leyte 
72.9 
2 
Misamis Onental 
85.9 
7 
Northern Samar 
72.9 
23 
Batanes 
86.1 
0 
Southern Leyte 
72.9 
0 
Cagayan 
86.1 
15 
Western Samar 
72.9 
64 
Isabela 
86.1 
4 
North Cotabato 
78.3 
125 
Nueva Vizcaya 
86.1 
3 
Sarangani 
78.3 
23 
Quirmo 
86.1 
0 
South Cotabato 
78.3 
5 
Bokal 
86.6 
2 
Suan Ku:iarat 
78.3 
18 
Cebu 
86.6 
0 
Zamboanga del Norte 
79.6 
8 
Negros Onertal 
86.6 
27 
Zamboarga del Sur 
79.6 
10 
Siquyjor 
86.6 
0 
Zamboanga Sibugay 
79.6 
3 
Abra 
89.2 
11 
Albey 
79.9 
35 
Apayap 
89.2 
0 
Camarines Norte 
79.9 
12 
Benguet 
89.2 
0 
Camarines Sur 
79.9 
44 
Ifugao 
89.2 
0 
Caanduancs 
79.9 
9 
Kahinga 
89.2 
11 
Masbate 
79.9 
42 
Mountain Province 
89.2 
0 
Sorsogon 
79.9 
52 
Veces Norte 
91.3 
0 
Compostela Valtey 
81.7 
126 
Lvees Sur 
91.3 
2 
Davaodcl Norte 
81.7 
35 
La Unon 
91.3 
0 
Davaedel Sur 
81.7 
64 
Pangasman 
91.3 
0 
Davao Orental 
81.7 
40 
Aurora 
92.1 
10 
Aklan 
82.6 
0 
Bataan 
92.1 
1 
Artque 
82.6 
1 
Bulacan 
92.1 
6 
Capuz 
82.6 
8 
Nueva Ecya 
92.1 
4 
Guimaras 
82.6 
0 
Pampenga 
92.1 
3 
Iloilo 
82.6 
8 
Tarlac 
92.1 
4 
Negros Occidental 
82.6 
26 
Zambales 
92.1 
6 
Marinduque 
83.9 
0 
Batangas 
93.5 
5 
Occedemta Mindoro 
83.9 
5 
Cavric 
93.5 
0 
Onental Mindoro 
83.9 
7 
Laguna 
93.5 
4 
Palawan 
83.9 
2 
Quezon 
93.5 
28 
Romblon 
83.9 
0 
Rizal 
93.5 
3 
Agusandel Norte 
85.7 
13 
Metropolzian Manila 
94 
1 
Aguxandel Sur 
85.7 
33 
The literacy data has been defined as L, the SigActs as V. Examine the histogram in Figure 6.6 to see that the data appears to follow a Poisson distribution. A goodnessoffit test (left as an exercise) confirms the data follows a Poisson distribution.
FIGURE 6.6: Histogram of SigActs Data
Use Maple to fit the data. First, remove the three outlier data points with values well over 100, as there are other much more significant generators of violence beyond literacy levels in those regions. We cannot use Maple’s ExponentialFit, as it attempts a logtransformation of SigActs which fails due to 0 values.
Plot the fit.
We accept that the fit looks pretty good.
The odds multiplier, e^{bl}, for our fit is _{e}°^{05u437} ~ 0.946 which means that for every 1 unit increase in literacy we expect violence to go down « 5.4%. This value suggests improving literacy will help ameliorate the violence.
Poisson Regression with Multiple Predictor Variables in Maple
Often, there are many variables that influence the outcome under study. We’ll add a second predictor to the Hospital Births problem.
Example 6.12. Hospital Births Redux.
Revisit Example 6.10 with an additional predictor: the type of hospital, rural (0) or urban (1). the new data appears in Table 6.13.
TABLE 6.13: Total Births vs. Caesarean Surgeries and Hospital Type
Total 
3246 
2750 
2507 
2371 
1904 
1501 
1272 
1080 
1027 
970 
Special 
26 
24 
21 
21 
21 
20 
19 
18 
18 
17 
Type 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
Total 
739 
679 
502 
236 
357 
309 
192 
138 
100 
95 
Special 
17 
16 
16 
16 
16 
15 
14 
14 
13 
13 
Type 
1 
1 
1 
1 
1 
0 
1 
0 
0 
0 
The data has been entered as B: Total, C: Special, and T: Type. After loading the Statistics package, define the model.
Collect the data and use NonlinearFit to fit the model.
Finishing the statistical analysis of the model is left as an exercise.
Exercises
 1. Adjust the nonlinear model for Afghanistan casualties, Example 6.5, to increase the amplitude of the sine term more quickly. How does the conclusion change, if at all?
 2. Investigate the action of parameters in the logistic function by executing the Maple statements below using the Explore command to make an interactive graph.
3. For the data in Table 6.14 (a) plot the data and (b) state the type of regression that should be used to model the data.
TABLE 6.14: Tire Tread Data
Number 
Hours 
Tread (cm) 
1 
2 
5.4 
2 
5 
5.0 
3 
7 
4.5 
4 
10 
3.7 
5 
14 
3.5 
6 
19 
2.5 
7 
26 
2.0 
8 
31 
1.6 
9 
34 
1.8 
10 
38 
1.3 
11 
45 
0.8 
12 
52 
1.1 
13 
53 
0.8 
14 
60 
0.4 
15 
65 
0.6 
4. Assume the suspected nonlinear model for the data of Table 6.15 is If we use a loglog transformation, we obtain
Use regression techniques to estimate the parameters a, b, and c, and statistically analyze the resulting coefficients.
TABLE 6.15: Nonlinear Data
X 
У 
Z 
101 
15 
0.788 
73 
3 
304.149 
122 
5 
98.245 
56 
20 
0.051 
107 
20 
0.270 
77 
5 
30.485 
140 
15 
1.653 
66 
16 
0.192 
109 
5 
159.918 
103 
14 
1.109 
93 
3 
699.447 
98 
4 
281.184 
76 
14 
0.476 
83 
5 
54.468 
113 
12 
2.810 
167 
6 
144.923 
82 
5 
79.733 
85 
6 
21.821 
103 
20 
0.223 
86 
11 
1.899 
67 
8 
5.180 
104 
13 
1.334 
114 
5 
110.378 
118 
21 
0.274 
94 
5 
81.304 
 5. Using the basic linear model у = j3o + f3x, fit the following data sets. Provide the model, the analysis of variance information, the value of R^{2}, and a residual plot.
 (а)
X 
100 
125 
125 
150 
150 
200 
200 
У 
150 
140 
180 
210 
190 
320 
280 
X 
250 
250 
300 
300 
350 
400 
400 
У 
400 
430 
440 
390 
600 
610 
670 
(b) The following data represents change in growth where x is body weight and у is normalized metabolic rate for 13 animals.
X 
no 
115 
120 
230 
235 
240 
360 
У 
198 
173 
174 
149 
124 
115 
130 
X 
362 
363 
500 
505 
510 
515 

У 
102 
95 
122 
112 
98 
96 
6. Use an appropriate multivariablemodel for the following ten observations of college acceptances to graduate school of GRE score, high school GPA, highly selective college, and whether the student was admitted. 1 indicates “Yes” and 0 indicates “No.”
GPA 
GRE 
Selective 
Admitted 
3.61 
380 
0 
1 
3.67 
660 
1 
0 
4.00 
800 
1 
0 
3.19 
640 
0 
0 
2.93 
520 
0 
1 
3.00 
760 
0 
0 
2.98 
560 
0 
0 
3.08 
400 
0 
1 
3.39 
540 
0 
0 
3.92 
700 
1 
1 
7. The data set for lung cancer in relation to cigarette smoking in Table 6.16 is from Frome, Biometrics 39, 1983, pg. 665674. The number of person years in parentheses is broken down by age and daily cigarette consumption. Find and analyze an appropriate multivariate model.
TABLE 6.16: Lung Cancer Rates for Smokers and Nonsmokers
Age 
Number Smoked per day 

Nonsmokers 
19 
1014 
1519 
2024 
2534 
> 35 

1520 
1 (10366) 
0 (3121) 
0 (3577) 
0 (4319) 
0 (5683) 
0 (3042) 
0 (670) 
2025 
0 (8162) 
0 (2397) 
1 (3286) 
0 (4214) 
1 (6385) 
1 (4050) 
0 (1166) 
2530 
0 (5969) 
0 (2288) 
1 (2546) 
0 (3185) 
1 (5483) 
4 (4290) 
0 (1482) 
3035 
0 (4496) 
0 (2015) 
2 (2219) 
4 (2560) 
6 (4687) 
9 (4268) 
4 (1580) 
3540 
0 (3152) 
1 (1648) 
0 (1826) 
0 (1893) 
5 (3646) 
9 (3529) 
6 (1136) 
4045 
0 (2201) 
2 (1310) 
1 (1386) 
2 (1334) 
12 (2411) 
11 (2424) 
10 (924) 
4550 
0 (1421) 
0 (927) 
2 (988) 
2 (849) 
9 (1567) 
10 (1409) 
7 (556) 
5055 
0 (1121) 
3 (710) 
4 (684) 
2 (470) 
7 (857) 
5 (663) 
4 (255) 
>55 
2 (826) 
0 (606) 
3 (449) 
5 (280) 
7 (416 
3 (284) 
1 (104) 
8. Model absences from class where:
School: school 1 or school 2 Gender: female is 1, male is 2 Ethnicity: categories 1 through 6 Math Test: score Language Test: score
Bilingual: categories 1 through 4
School 
Gender 
Ethnicity 
Math Score 
Lang. Score 
Bilingual Status 
Days Absent 
1 
2 
4 
56.98 
42.45 
2 
4 
1 
2 
4 
37.09 
46.82 
2 
4 
2 
1 
4 
32.37 
43.57 
2 
2 
1 
1 
4 
29.06 
43.57 
2 
3 
2 
1 
4 
6.75 
27.25 
3 
3 
1 
1 
4 
61.65 
48.41 
0 
13 
1 
1 
4 
56.99 
40.74 
2 
11 
2 
2 
4 
10.39 
15.36 
2 
7 
1 
2 
4 
50.52 
51.12 
2 
10 
1 
2 
6 
49.47 
42.45 
0 
9 
Projects
Project 1. Fit, analyze, and interpret your results for the nonlinear model у = a t^{h} with the data provided below. Produce fit plots and residual graphs with your analysis.
Project 2. Fit, analyze, and interpret your results for an appropriate model with the data provided below. Produce fit plots and residual graphs with your analysis.
Year 
0 1 2 
3 
4 5 
6 
7 
8 
9 
10 
Quantity 
15 150 250 
275 
270 280 
290 
650 
1200 
1550 
2750 
t 
7 
14 
21 
28 
35 
42 
У 
8 
41 
133 
250 
280 
297 
Project 3. Fit, analyze, and interpret your results for the nonlinear model у = at^{b} with the data provided by executing the Maple code below. Produce fit plots and residual graphs with your analysis. Use your phone number (no dashes or parentheses) for PN.
 [1] See David L. Smith, Less Than Human: Why We Demean, Enslave, and ExterminateOthers.
 [2] E. Melander, M. Oberg, and J. Hall, “The ‘New Wars’ Debate Revisited: An Empirical Evaluation of the Atrociousness of ‘New Wars’,” Uppsala Univ. Press, Uppsala, 2006.Available at www.pcr.uu.se/digitalAssets/654/c_654444l_lk_uprp_no_9.pdf.
 [3] J. Kreutz, “How and When Armed Conflicts End: Introducing the UCDP ConflictTermination Dataset,” J. Peace Research, 47(2), 2010, 243250.
 [4] ’“Homoscedasticity: All random variables have the same finite variance.
 [5] Adaptecl from “Research Methods II: Multivariate Analysis,” J. Trop. Pediatrics,Online Feature, (2009), pp. 136143. Originally at: www.oxfordjournals.org/our_journals/tropej/online/ma_chapl3.pdf.
 [6] 1:iData sources: National Statistics Office (Manila, Philipppines) and the Archives of theArmed Forces of the Philippines.