Modeling, Correlation, and Regression
Return to the Philippines scenario introduced at the beginning of the chapter. The researcher actually began with linear regression. For example, in 2008, an analysis of the literacy index per region versus the number of significant acts of violence (by terrorists or insurgents) produced the linear model
with correlation coefficient —0.3742 or R? = 0.14 (R^{2} is the square of the correlation coefficient). The model is plotted in Figure 6.2.
FIGURE 6.2: Philippines: Literacy v. Significant Acts of Violence, 2008
Within all the data, Figure 6.2 represents one of the “best” resulting overall models -note that R^{2} is only 0.14 which represents a really poor linear fit or a very weak linear relationship. We will return to this scenario again later in the chapter, but suffice it to say, linear regression is not useful to explain, predict, or answer the questions the Philippine government had posed.
Linear, Nonlinear, and Multiple Regression
Many applications are not linear. For example, a pendulum is modeled by the differential equation 9" + w^{2}sin(0) = 0; this equation can be “linearized” by replacing sin(d) with 0, a good linear approximation for small values of 9. Two widely used rules of thumb for relating correlation to the likelihood of a linear relationship come from Devore [D2012] for math, science, and engineering data, shown in Table 6.1, and a more liberal interpretation from Johnson [.12012] for non-math, non-science, and non-engineering data, given in Table 6.2.
TABLE 6.1: Linearity Likelihood for Math, Science, and Engineering Data; Devore
Correlation p |
Relationship |
0.8 < p < 1.0 |
Strong linear relationship |
0.5 < p < 0.8 |
Moderate linear relationship |
0.0 < p < 0.5 |
Weak linear relationship |
TABLE 6.2: Linearity Likelihood for Non-Math, Non-Science, and Non- Engineering Data; .Johnson
Correlation p |
Relationship |
0.5 < p < 1.0 |
Strong linear relationship |
0.3 < p < 0.5 |
Moderate linear relationship |
0.1 < p < 0.3 |
Weak linear relationship |
0.0 < p < 0.1 |
No linear relationship |
In our modeling efforts, we emphasize the interpretation of p « 0. A very small correlation coefficient can be interpreted as indicating either no linear relationship or the existence of a nonlinear relationship. Many neophyte analysts fail to pick up on the importance of potential nonlinear relationships in their interpretation.
Let’s look at an example where we attempt to fit a linear model to data that may not be linear.
Example 6.1. Exponential Decay.
Build a mathematical model to predict the degree of recovery after discharge for orthopedic surgical patients. There are two variables: time t in days in the hospital, and a medical prognostic index for recovery у with larger values indicating a better prognosis. The data in Table 6.3 comes from Neter et al. [NKNW1996, p. 469].
TABLE 6.3: Hospital Stay vs. Recovery Index Data
t |
2 |
5 |
7 |
10 |
14 |
19 |
26 |
31 |
34 |
38 |
45 |
52 |
53 |
60 |
65 |
У |
54 |
50 |
45 |
37 |
35 |
25 |
20 |
16 |
18 |
13 |
8 |
11 |
8 |
4 |
6 |
First, a scatterplot of the data.
The scatterplot shows a clear negative trend.
Check the correlation p.
Using either rule of thumb, the correlation coefficient p = —0.94 indicates a strong linear relation. With this value in hand, looking at the scatterplot leads us to think we will have an excellent linear regression model.
Let’s determine the linear model, calling it RI.
It’s time for diagnostics. Calculate the residuals and percent relative errors, then plot the residuals.
Look at the worst percent relative error values.
We’ve obtained the model, у = f(t) = 49.4601 — 0.75251L The sum of squared error is 451.1945, the correlation is —0.94, and R^{2}, the coefficient of determination which is the correlation coefficient squared, is 0.886. These are all indicators of a “good” linear model.
But...
Although some percent relative errors are small, others are quite large, with eight of the fifteen over 20%. The largest two positive errors are ss 67% and 140%, and the largest two negative errors are « —45% and —57%. How much confidence would you have in making predictions with this model? The residual plot clearly shows a curved pattern. The residuals and percent relative errors show that we do not have an adequate model. Advanced courses in statistical regression will show how to attempt to correct this inadequacy.
Further, suppose we need to predict the index when time was 100 days.
A negative value is clearly unacceptable and makes no sense in the context of our problem since the index is always positive. This model does not pass the common sense test.
With a strong correlation of —0.94, what went wrong? The residual plot diagnostic shows a curved pattern. In many regression analysis books, the suggested first-attempt cure for a curved residual is adding a nonlinear term that is missing from the model. Since we only have (t, y) data points, we’ll add a nonlinear term a^x^{1} to the model.
Example 6.2. Multiple Linear Regression.
Fit a parabolic model f(x) = ao+aix+a,2X^{2} to the Hospital Stay vs. Recovery Index Data of Table 6.3.
This model, although a parabola, is linear in the “variables” ao, ai, and o_{2}.
Let’s find the fit with Maple.
Use the formula RS = 1 — SSE/SST to compute R^{2}, the coefficient of determination, for the quadratic fit.
These diagnostics look quite good. The sum of the squares of the errors is much smaller. The coefficient of determination R^{2} is larger, and the correlation has increased to 0.99.
Let’s plot the new residuals to see if there is still a curved pattern.
Use the model to predict the index at 100 days. We find QF(100) = 32.92. The answer is now positive, but again does not pass the common-sense test. The quadratic function is now curving upwards toward posit ive infinity. Graph it! An upward curve is an unacceptable outcome; an increasing curve for the index indicates that staying in the hospital for an extended time leads to a better prognosis. We certainly cannot use the model to predict the values of the index for t beyond 60.
We must try a different model. Looking at the original scatterplot leads us to attempt fitting an exponential decay curve.
Example 6.3. Nonlinear Regression: Exponential Decay.
Fit an exponential decay model f(x) = aoe^{aiX} to the Hospital Stay vs. Recovery Index Data of Table 6.3.^{[1]}
Maple’s NonlinearFit command from the Statistics package does not do well unless we specify reasonable estimates for the parameters of our model, here ao and oi.
Test the fit at t. = 100.
A ridiculous value!
Care must be taken with the selection of the initial values for the unknown parameters (see, e.g., Fox [Fox2012]). Maple yields good models when the calculations are based upon good input parameters. How can we get reasonable approximations to a о and ai? Let's use a standard technique: transform the model to make it linear. Apply logs to our model.
Relabel In(y) = Y and ln(ao) = Ao to obtain Now use Maple to fit the transformed model.
Good estimates to use for the parameters are
This prediction is much closer to what we expect. A common technique used to test the result is to recompute the model with different, but close, initial values. Try it!
Note how close the nonlinear-fitted parameter values are to the estimates. The difference comes from where least squares is applied in the computation: to the logs of data points, rather than to the points. Write the general least squares formulas for the model and the transformed model to see the difference. Also, compare the model from Maple’s ExponentialFit to the one we have found.
For a nonlinear model, linear correlation and R^{2} have no meaning. Computing the sum of squared error gives SSE = 45.495. This value is substantially smaller than SSE = 451.1945 obtained by the linear model. The nonlinear model appears reasonable. Now check the nonlinear model’s residual plot for patterns.
Figure 6.3 shows the exponential decay model overlaid on the data
FIGURE 6.3: Exponential Decay Fit to Hospital Recovery Data
The percent relative errors are also much improved—only four are larger than 20%, and none are beyond 36.3%. Verify this!
Using this model to predict the index when t = 100 gave у = 1.12. This new result passes the common-sense test. This model would be our recommendation for the hospital’s data.
The text’s PSMv2 library contains Nonlinear Regression, a program we wrote for nonlinear regression in Maple that prints a diagnostic report, and then returns a fit of the specified model.
We see that both coefficients are statistically significant (P-values < 0.05).
Just for illustration, change the model to a + b ■ exp(c • x) and rerun Nonlinear Regression.
Select the model with:
The new constant term a is not statistically significant (P-value 3> 0.05).
Exercises
In Exercises 1 through 4 compute the correlation coefficient and then fit the following data with the given models using linear regression.
1.
X |
i |
2 3 4 |
5 |
У |
i |
1 2 2 |
4 |
- (a) у = а + Ьх
- (b) у = ах^{2}
- 2. Data from stretching a spring:
x(x10^{-3}) |
5 10 |
20 |
30 |
40 |
50 |
60 |
70 |
80 |
90 |
100 |
у (x 10^{5}) |
0 19 |
57 |
94 |
134 |
173 |
216 |
256 |
297 |
343 |
390 |
- (a) у = ax
- (b) у = a + bx
- (c) у = ax^{2}
- 3. Data for the ponderosa pine:
X |
17 |
19 |
20 |
22 |
23 |
25 |
28 |
31 |
32 |
33 |
36 |
37 |
39 |
42 |
У |
19 |
25 |
32 |
51 |
57 |
71 |
113 |
140 |
153 |
187 |
192 |
205 |
250 |
260 |
- (a) у = ax
- (b) у = a + bx
- (c) у = ax^{2}
- (d) у = ax^{3}
- (e) у = a_{:i}x^{3} + U2'.r^{2} + ax + oo
- 4. Fit the model у = ax^{3}/^{2} to Kepler’s planetary data:
Body |
Period |
Distance from sun (m) |
Mercury |
7.60 x 10^{G} |
5.79 x Ю^{10} |
Venus |
1.94 x 10^{7} |
1.08 x 10^{11} |
Earth |
3.16 x 10^{7} |
1.5 x 10^{11} |
Mars |
5.94 x 10^{7} |
2.28 x 10^{11} |
Jupiter |
3.74 x 10^{8} |
7.79 x 10^{n} |
Saturn |
9.35 x 10^{8} |
1.43 x 10^{12} |
Uranus |
2.64 x 10^{9} |
2.87 x 10^{12} |
Neptune |
5.22 x 10^{9} |
4.5 x 10^{12} |
5. Fit the following data, Experience vs. Flight time, with a nonlinear exponential model у = ae^{bx}.
Crew Experience (months) |
Total Flight Time |
Crew Experience (months) |
Total Flight Time |
91.5 |
23.79812835 |
116.1 |
33.85966964 |
84 |
22.32110752 |
100.6 |
25.21871384 |
76.5 |
18.59206246 |
85 |
25.63021418 |
69 |
15.06492527 |
69.4 |
18.52772164 |
61.5 |
14.9960869 |
53.9 |
12.88023798 |
80 |
24.63492506 |
112.3 |
31.42422451 |
72.5 |
18.87937085 |
96.7 |
26.299381 |
65 |
18.60416326 |
81.1 |
19.6670982 |
57.5 |
13.6173644 |
65.6 |
15.21024716 |
50 |
15.62588379 |
50 |
11.977759 |
103 |
28.46073261 |
120 |
36.33722835 |
95.5 |
30.39886433 |
104.4 |
29.35965197 |
88 |
25.06898139 |
88.9 |
21.78073392 |
80.5 |
21.30460092 |
73.7 |
21.72030963 |
73 |
21.98724383 |
57.8 |
16.19643014 |
Projects
Project 6.1. Write a program without using statistical programs/software to
- (a) compute the correlation between the data, and
- (b) find the least squares fit for a general proportionality model у is proportional to x^{n}.
Project 6.2. Write a program without using statistical programs/software to
- (a) compute the correlation between the data, and
- (b) find the least squares fit for a general polynomial model у = «о + ax + a^x^{1} + • • • + a_{n}x^{n}.
- [1] Adapted from Neter et al. [NKNW1996] and Fox [Fox2012|.