History
Linear regression has a twofold conceptual and mathematical history. The concept behind regression is commonly traced to Sir Francis Galton and his studies of heredity. Gallon’s interest focused on how strongly the characteristics of one generation influenced the characteristics of the next generation. Using data he collected on sweet pea plants as well as the heights of fathers and sons, Galton plotted two-dimensional diagrams that illustrated the basic ideas of linear regression (see text box below on the term “regression”). Yet his perceptions were focused on the biological aspects of regression, rather than the mathematics, and others were left to expand it into the robust statistical technique we know today.
The mathematical proofs behind correlation and regression are usually attributed to Karl Pearson, who developed what is known as the Pearson Product Moment Correlation (PPMC). This is commonly referred to as the Pearson correlation coefficient r, and is used to describe the strength of the relationship between two variables without inferring causality (see Chapter 9). The values of r range from -1 to +1.
The square of this correlation coefficient R', known as the coefficient of determination (which you will learn more about in this chapter), has a more intuitive meaning. It represents the portion of the variation in a dependent variable that can be explained by an independent variable or variables in a regression equation. Since r is squared, R- values range from 0 to 1.
The term “regression,” in the statistical sense, comes from Sir Francis Gallon’s work on genetics, where he compared the heights of fathers and their sons. He found that extremely tall fathers had tall sons but usually not as tall as they (the fathers) were. Similarly, extremely short fathers had short sons but usually not as short as they were. In other words, the sons’ heights regressed or returned towards the mean or average height.
Mechanics
Ordinary least squares (OLS) is the most basic tool of linear regression. OLS can be applied as simple linear regression (with just one independent variable) or as multiple linear regression (with two or more independent variables).
Simple Linear Regression
In its most simple form, OLS regression evaluates the relationship between two variables: a continuous dependent variable and one (usually continuous) independent
224 Keunhyun Park et al.
variable, with the dependent variable expressed as a linear function of the independent variable. In other words, a change in the independent variable affects a constant change in the dependent variable. When this is the case, the relationship can be expressed using the equation:
where
a = intercept with the Y-axis when X = 0
b = the slope of the line (AY/AX)
X = value of the independent variable
Y = value of the dependent variable
For example, if a = 8 and b = 0.75, then the equation for your line is:
and some corresponding (X, Y) values that fit directly on this line would be:
As you can see, the relationship is constant. For every one-unit increase in X, Y increases by 0.75.
Plotted, the line Y = (8) + 0.75(X) looks the one shown in Figure 12.1.
In research, your data will never fall exactly on a straight line. For this reason, the standard regression equation describing the relationship between variables for a population also contains an error term s.
The error term is also known as the disturbance or remainder term. It plays a very important role in regression analysis.
With regression, you can determine an equation for the line that best represents (or fits) the relationships between the variables in your dataset. These are sometimes described as best-fit equations. Such equations are defined by the form and values of their parameters—a (the intercept) and P (the slope). These, again, apply to a population.
Like Daisa and Peers (1997), let’s say you want to explore the theory that traffic speeds increase with street width in residential areas. As a city planner, you have collected data on the traffic speed for five streets in your city, and you know the width of each street. Using this data, you can estimate the relationship between street width and traffic speed using linear regression. You will then be able to predict how a change in the built environment (street width) will impact human behavior (traffic speed) and to use this information for policy decisions about residential street standards in your city. (Note: We start with this simple hypothetical dataset to introduce the mathematics of linear regression. Following the multiple regression in the Step by Step section, we use a large dataset to illustrate realistic regression modeling in SPSS and R.)
The data you have collected is shown in Table 12.1.
Table 12.1 Hypothetical Data of Street Width and Traffic Speed
Street Width (X) |
Traffic Speed (Y) |
20 |
25 |
25 |
26 |
30 |
28 |
35 |
35 |
40 |
39 |
Using simple linear regression, we will determine the straight-line equation that provides the best fit to the data:
Best fit means that our goal is to minimize the estimated error e—the difference between the true value ofY for each observation (each actual data point) and the estimated or predicted value of Y based on the linear equation. The constant a and coefficient b apply to a sample. The estimated error e is also known as the residual.
The square of the estimated errors for each data point, added all together, is called the sum of squared errors (SSE). The line with the lowest possible SSE best describes the relationship between the independent and dependent variables (hence the term least squares'). Note that the errors must be squared because some will be positive (where the observed data point falls above the regression line), while others will be negative (where the observed data point falls below the line). Large errors on both sides could cancel out, suggesting a good fit to the data when the actual fit is very' poor. Squaring the errors eliminates the negatives, thus avoiding this problem.
Linear regression is the only form of regression where the best-fit equation can be computed directly. The other regression methods rely' on numerical methods, iterating to best-fit values. In linear regression, we use two equations to determine the
226 Keunhyun Park et al.
best-fit line for a dataset. First, we calculate the value of b (the slope), which can be represented as:
where
X = sample mean of X
Y = sample mean of Y
i = Xi, Yi value for each observation, and
Z = the sum for ever}' case (X, Y)
Second, we calculate the value of a (the intercept), which can be represented as:
For our hypothetical example, therefore, we can calculate the best-fit line as follows:
As shown in Table 12.2, thus,
and
So, the best fit regression line is:
By plugging in different values of X (street width), you can now predict different values of Y (traffic speed). You know that the predicted value of Y will increase by 0.74 for each one-unit increase in X. That is, traffic speed is predicted to increase by 0.74
Table 12.2 Calculating Deviance
Street Width (X) |
Traffic Speed (Y) |
X— X |
Y— Y |
20 |
25 |
-10 |
-5.6 |
25 |
26 |
-5 |
-4.6 |
30 |
28 |
0 |
-2.6 |
35 |
35 |
5 |
4.4 |
40 |
39 |
10 |
8.4 |
Linear Regression 227 mph for each one-foot increase in street width. You must also understand, however, that your predicted values will be different from the actual values. There is a degree of error incorporated in the regression model, which becomes clear when we plot the estimated line along with the observed data (Figure 12.2).
It is important to emphasize that the regression model is not reality, it is a model of the world as we see it. Importantly, the model applies to the range of our experience, or maybe slightly beyond it. Notice in this picture, the model seems to imply that if the street width were zero, the traffic speed would be 8 mph. The problem is that the model is only meant to apply to street widths from 20 or so to 40 or so.
In regression equations, the notations for the parameters a and b are often switched out for b’s, such that
where
Y = dependent variable
X= independent variable
b = parameters (b„ is the constant, b, the coefficient)
e = random error term—the difference between the observed value of Y and the predicted value of Y
We will use this notation for the remainder of the chapter.