The Sums of the Squared Distances to the Mean
Each dot in figure 21.4 is physically distant from the dotted mean line by a certain amount. The sum of the squares of these distances to the mean line is the smallest sum
possible (that is, the smallest cumulative prediction error you could make), given that you only know the mean of the dependent variable. The distances from the dots above the line to the mean are positive; the distances from the dots below the line to the mean are negative. The sum of the actual distances is zero. Squaring the distances gets rid of the negative numbers.
But suppose you do know the data in table 21.14 regarding the infant mortality rate for each of those 10 countries. Can you reduce the prediction error in guessing the TFR for those countries? Could you draw another line through figure 21.4 that ‘‘fits’’ the dots better and reduces the sum of the distances from the dots to the line?
You bet you can. The solid line that runs diagonally through the graph in figure 21.4 minimizes the prediction error for these data. This line is called the best fitting line, or the least squares line, or the regression line. When you understand how this regression line is derived, you’ll understand how correlation works.
The formula for the regression line is
where y is the variable value of the dependent variable, a and b are some constants (which you’ll learn how to derive in a moment), and x is the variable value of the independent variable. The constant, a, is computed as:
and b is computed as
Table 21.15 shows the data needed for finding the regression equation for the raw data in table 21.14. At the bottom of table 21.15 you’ll find a shortcut formula (formula 21.22) for calculating Pearson’s r directly from the data in the table.
The constant b is:
and the constant a is then:
The regression equation for any pair of scores on infant mortality (x) and TFR (y), then, is: