Multiple Linear Regression

Whereas simple linear regression uses only two variables—one dependent and one independent—multiple regression involves multiple independent variables. Planners primarily use multiple regression rather than simple regression. The world is usually not so simple as to have the variance in one single variable completely explain the variance in another. Thus, if researchers do not include multiple variables in their models, they run the risk of under-specifying their regression equation, which may produce biased regression coefficients.

In Chapter 5, you were introduced to several types of variables: determinants, correlates, confounders, moderators, and mediators. All may find their way into multiple regression analyses (and more sophisticated models we consider subsequently). In multiple regression, we particularly want to control for confounding variables. These variables, as you may recall, are correlated with determinants and causally related to outcome variables. A failure to control for them, through multiple regression, will cause researchers to attribute some of the effects of confounding variables to other determinants. We momentarily return to our example of street width versus traffic speed to illustrate this phenomenon.

With multiple variables, we must think in terms of multiple dimensions. With three variables, for example—one dependent variable (Y) and two independent variables (X, and X.,)—we would be estimating a best-fit plane (see Figure 12.5). Just as with

A Best-Fit Plane of a Regression With Two Independent Variables

Figure 12.5 A Best-Fit Plane of a Regression With Two Independent Variables

Linear Regression 233 the best-fit line, the best-fit plane minimizes the distance between the actual values of Y and the predicted values of Y. The constant is the intercept of the plane with the Y-axis when both X variables equal zero.

For example, in our hypothetical study, factors other than street width likely affect traffic speed. One possibility could be that the average setback of buildings from the street edge affects traffic speed. It has long been speculated that small building setbacks reduce the perceived width of streets, thus calming traffic. If you wanted to evaluate both setbacks and street widths together, you would use a multiple regression equation with these two independent variables. The Y-axis/dependent variable would still be traffic speed, but now you would have two X-axes—one for street width and one for average setback.

A data table for our new example might look like Table 12.4.

The generic regression equation for this example would be:

If you perform calculations like those shown in the preceding simple regression section, you would find that the best-fit regression equation is:

Multiple regression estimates the marginal effect of each independent variable, holding the other independent variables constant. This equation suggests that controlling for the average building setback, a one-foot increase in street width is associated with a 0.56 mph increase in traffic speed. This is less than the 0.74 mph marginal increase we estimated using simple linear regression with only one independent variable, average street width. This means that the average building setback confounds the relationship between average street width and traffic speed, being correlated with average street width and accounting for some of the effect of average street width when average setback is not controlled.

The statistics for interpreting multiple regression models are the same as for simple linear regression, except that each independent variable in the multiple regression model has its own t-statistic. Each t-statistic evaluates whether or not a particular variable’s coefficient (bn) is significantly different from zero.

Again, if you perform calculations like those shown in the preceding simple regression section, you should find the following values:

Table 12.4 Hypothetical Data of Street Width, Average Setback, and Traffic Speed

Street Width (X,)

Average Setback (X2)

Traffi c Speed (Y)
















t-statistic for b, = 2.016

t-statistic for b2 = 0.728

A helpful way to understand these results is to compare them to the simple linear regression. The R- is higher, as it must be with one additional independent variable, so the line is a better fit to the data. However, the t-statistics for both coefficient estimates are lower in the multiple regression model, as is the F-statistic.

First, consider the F-statistic. Recall that the degrees of freedom are, respectively, p-1 for the numerator, or 2, and n-p for the denominator, or 2. For the F-statistic, as the first value increases with more independent variables, the second value decreases. The critical F-value at the 0.05 significance level for the indicated degrees of freedom is 19.0. Thus, since our F-statistic is less than the critical value, we conclude that our model as a whole is not significant at 0.05 level. The probability' that our results are due to chance is greater than 5 percent, but just a little bit (6.5 percent). If we adopt the 0.10 significance level instead, the F-statistic becomes significant.

Now consider the t-statistics. Since we’ve added a variable to the equation, we now have one fewer degrees of freedom (n-p = 2). For 2 degrees of freedom, at the 0.05 significance level, the minimum t value is 2.92. Since our t-statistics (2.016 and 0.728, respectively) are both smaller than this critical value, we cannot conclude that either coefficient is significantly different from zero.

Three things are going on here that distinguish this case from the single-independent-variable case, where we established statistical significance. Recall that:

Controlling for average building setback has reduced the apparent strength of the relationship between average street width and traffic speed, reducing the value of b|. At the same time, losing a degree of freedom has increased the value of SEbl (see preceding formula) and increased the critical t-values for statistical significance. The critical values for t and F fall dramatically when df gets past 4 or so. It was only for the calculation in the simple regression section that we kept the sample size so small. With seven or eight street segments, the results could have been very different.

This book does not cover Bayesian statistics, but they are relevant at this point. It is not that street width is unrelated to traffic speed. There is ample evidence that the two are related. It is just that we cannot be confident (at a conventional level) of a relationship based on such a small sample of streets. Given a larger sample, the same value of b, might have been statistically significant and we could have asserted a relationship between the two variables with some confidence. Also, reviewing previous studies of street width versus traffic speed, we would not have given too much weight to the negative finding in this case (Ewing, 2012).

Overall, we would conclude that while the model as a whole performs fairly well, the independent variables do not perform well and are not easily interpreted.

As you might imagine, multiple regression generally involves more than two independent variables. In generic terms, we want to estimate an equation with n independent variables that minimizes the sum of squared errors between observed values

Linear Regression 235 ofY and the predicted values ofY on a hyperplane in k dimensions. The equation for a regression model with k variables is:


Y = dependent variable

X = 1 to k independent variables b = associated parameters e = random error term

The Y-intercept is still the constant term (the value ofY when all independent variables are zero), and the coefficient bk for each variable Xk still represents the slope ofYwith respect to Xk, holding other independent variables constant. The R-squared value still represents how much variation in Y is explained by the independent variables collectively. The F-statistic still determines whether or not the model as a whole is significant, and the t-statistic for each independent variable still determines whether or not that variable’s coefficient can be considered significantly different from zero.

< Prev   CONTENTS   Source   Next >