Interpreting Simple Linear Regression Results
Once you’ve created a regression model with your data, the next step is to assess how good the estimations actually are. In our simple linear regression example, we want to know if the relationship between street width and traffic speed is merely due to
chance, or if it can be generalized as a relationship for all residential streets and thus used as a planning tool.
The basic fit of the model is evaluated using three statistics: R-squared, F-statistic, and t-statistic(s). These evaluate the performance of the overall model (R-squared and F-statistic) and the parameters within the model (t-statistic). When you read research articles that employ regression analysis, R-squared, F-statistic, and t-statistic(s) are the values you usually see reported. We explain each statistic here, in general terms, before illustrating with specific calculations from our hypothetical dataset. Note that whatever statistical software package you use for conducting regression analysis will calculate these for you (as we show using SPSS or R later), but it is important to know where these values actually come from.
R-squared: Goodness of Fit
The R-squared statistic measures the proportion of the variation in the dependent variable explained by the independent variable (s). It is equal to one minus the ratio of the SSE to the sum of squared deviations about the mean of the dependent variable (this is the measure of the total variation of the dependent variable):
A high value of R-squared suggests that the model explains the variation in the dependent variable well. This is very' important when using a model for predictive or forecasting purposes but is less important when searching for evidence that one variable influences another (hypothesis testing).
Regressions with low R-squared values will often, but not always, yield parameter estimates with small t-statistics for any null hypothesis (see below t-Statistic(s) section). A low R-squared value may indicate that important factors have been omitted from the regression model, which raises the concern of omitted variable bias (see data-related problems later in the chapter).
F-statistic
The F-statistic, or F-ratio, tells you the significance of the model as a whole. The F-statistic evaluates all of the coefficients in the model jointly, indicating whether they collectively are different from zero. Along with the R-squared value, this is a way to tell how well the regression equation fits the data.
where
R-= the model’s R-squared value n = the number of observations p = the number of parameters in the regression equation (number of coefficients plus the constant)
Intuitively, the F-statistic formula compares the accuracy of the model R^{2} to the inaccuracy (1- R^{2}), adjusting for the degrees of freedom (df). For the numerator of the preceding equation, R^{2}, df equals p-1- For the denominator of the preceding equation, (1- R^{2}), df equals n-p.
The F-statistic relies on the F distribution, a probability distribution with a table of critical values (see Figure 12.3). The critical value at any significance level depends upon the degrees of freedom of the model (df,, df,). If the computed F-statistic is greater than the critical value from the F table for the appropriate degrees of freedom, then you can reject the null hypothesis that the combination of your regression coefficients is zero, or equivalently reject the null hypothesis that the model with its independent variable (s) is no better than the null model with a constant term only.
t-Statistic(s)
The t-statistic, also called the t-ratio, is defined by the following equation:
where b, is the slope of the regression line and SE_{b|} is the standard error of the slope. This value tells you the significance of the coefficient estimate b,.
The significance of the t-statistic is based on the probability distribution known as the Student’s t distribution (see Chapter 10 and Figure 12.4). The value against which the t-statistic is compared comes from a t table and depends upon degrees of freedom, n-p. If the t-statistic calculated from your model is greater than the critical t value for the appropriate degrees of freedom, then you can reject the null hypothesis that the coefficient b, is equal to zero, which is equivalent to saying that there is no relationship between the independent variable and the dependent variable.
Figure 12.3 Probability Distribution Curves for F-Statistic With Different Degrees of Freedom
Figure 12.4 Probability Distribution Curves for t-Statistic With Different Degrees of Freedom
It is standard convention to adopt the 95 percent confidence level, or 0.05 significance level, in such comparisons. At the 0.05 significance level, only 5 percent of the area under the probability distribution curve exceeds that particular value of F or t. Thus, while it is possible to obtain a value of F or t larger than this critical value by chance, it is unlikely. In other words, at the 0.05 significance level, you can be 95 percent certain that the value of F or t is different from zero, or equivalently that the observed relationship between X, and Y is due not to chance but to a real association between the variables.
For our hypothetical dataset, expanded with values of the error term, we can calculate these three statistics (Table 12.3).
Our equation explains almost 92 percent of the variation in Y, or equivalently, street width explains almost 92 percent of the variation in traffic speed.
Table 12.3 Calculating Error Terms
Street Width (X) |
X— X |
Traffi c Speed (Y) |
Predicted Value o/Y(Y) |
Residual (Y-Y) |
Y— Y |
20 |
-10 |
25 |
23.2 |
1.8 |
-5.6 |
25 |
-5 |
26 |
26.9 |
-0.9 |
-4.6 |
30 |
0 |
28 |
30.6 |
-2.6 |
-2.6 |
35 |
5 |
35 |
34.3 |
0.7 |
4.4 |
40 |
10 |
39 |
38 |
1 |
8.4 |
The critical F-statistic for 1 and 3 degrees of freedom is 34.116 at the 0.01 significance level. Since our F-ratio is slightly smaller than this, but close, we look for the 0.05 significance level. The critical value for 1 and 3 degrees of freedom at the 0.05 significance level is 17.443, much below the F-statistic of the model. We therefore conclude that our model as a whole is significant at approximately the p = 0.01 level.
where
The symbol Y represent the square root; thus
The critical t-value for 3 degrees of freedom (five observations minus two parameters) at the 0.01 significance level is 4.5407. Since our t-statistic is larger than the critical value at this significance level, we can conclude that the coefficient estimate b, is significant at the 0.01 level or beyond. There is only one chance in 100 that this strong association between X, and Y is due to chance.