Multicollinearity and diagnostics
Multicollinearity refers to a situation with a high correlation among the explanatory variables within a multiple regression model. For the obvious reason it could never appear in the simple regression model, since it only has one explanatory variable. In chapter 8 we shortly described the consequences of including the full exhaustive set of dummy variables created from a categorical variable with several categories. We referred to that as to fall in the dummy variable trap. By including the full set of dummy variables, one end up with a perfect linear relation between the set of dummies and the constant term. When that happens we have what is called perfect multicollinearity. In this chapter we will in more detail discuss the issue of multicollinearity and focus on what sometimes is called imperfect multicollinearity which referrers to the case where a set of variables are highly correlated but not perfect.
The lack of independence among the explanatory variables in a data set. It is a sample problem and a state of nature that results in relatively large standard errors for the estimated regression coefficients, but not biased estimates.
The consequences of perfect correlation among the explanatory variables is easiest explained by an example. Assume that we would like to estimate the parameters of the following model:
where X1 is assumed to be a linear combination of X2 in the following way:
and where a and b are two arbitrary constants. If we substitute (11.2) into (11.1) we receive:
Since (11.1) and (11.2) implies (11.3) we can only receive estimates of [B0 + aB2) and (Bx + bB2). But since these two expressions contain three unknown parameters there is no way we can receive estimates for all three parameters in (11.1). We simply need more information, which is not available. Hence, with perfect multicollinearity it is impossible to receive an estimate of the intercept and the slope coefficients.
This was an example of the extreme case of perfect multicollinearity, which is not very likely to happen in practice, other than when we end up in a dummy variable trap or a similar situation. More interesting is to investigate the consequences on the parameters and their standard errors when high correlation is present. We will start this discussion with the sample estimator of the slope coefficient B1 in (11.1) under the assumption that X1 and X2 is highly correlated but not perfect. The situation for the sample estimator of B2 is identical to that of B1 so it is not necessary to look at both. The sample estimator for B1 is given by:
The estimator b1 is a function of r which is the correlation between Y and X1, r the correlation between X1 and X2, rY2 the correlation between Y and X2, SY and S1 which are the standard deviations for Y and X1 respectively.
The first thing to observe is that r appears in both the numerator and the denominator, but that it is squared in the denominator and makes the denominator zero in case of perfect correlation. In case of a strong correlation, the denominator has an increasing effect on the size of the expression but since the correlation coefficient appears in the numerator as well with a negative sign, it is difficult to say how the size of the parameter will change, without any further assumptions. However, it can be shown that the OLS estimators remain unbiased and consistent, which means that estimated coefficients in repeated sampling still will center around the population coefficient. On the other hand, this property says nothing about how the estimator will behave in a specific sample. Therefore we will go through an example in order to shed some light on this issue.
Consider the following regression model:
We would like to know how the estimate of B1 changes when we include another variable X2 that is highly correlated with X1. Using a random sample of 20 observations we calculate the following statistics.
For the simple regression case we receive:
For the multiple regression case when including both X1 and X2 we receive:
Hence, when including an additional variable the estimated coefficient decreased in size as a result of the correlation between the two variables. Is it possible to find an example where the estimator is increasing in size in absolute terms? Well, consider the case where X2 is even more correlated with X , lets say that r12=0.99. That would generate a negative estimate and the small number in the denominator will make the estimate larger in absolute terms. It is also possible to make up an examples where the estimator moves in the other direction. Hence, the estimated slope coefficient could move in any direction as a result of multicollinearity.
In order to analyze how the variance of the parameter estimates change it is informative to look at the equation for the variance. The variance of (11.4) is given by the following expression
When the correlation between x and x2 equals zero, will the variance of the multiple regression coefficient coincide with the variance for the coefficient of the simple regression model. However, when the correlation equals 1 or-S the variance given by (11.5) will be undefined just as the estimated slope coefficient. In sum, the greSte0 the degree of the multicollinearity, the less precise will be the estimates of the parameters, which means that the estimated coefficients will -vary a lot from sample to sample. But make no mistakes; does not destroy the nice property of minimum variance among lineal" unbiased estimator. It still has a minimum 'variance, but minimum variance does not mean that the variance will be small.
It seems like the level of both the estimated parameter and its standard error are affected by multicollinearity. But how will this affect the ratio between them; the f-value. It can be shown that the computed f-value in general will decrease since the standard error is affected more strongly compared to the coefficient. This will usually result in non-significant parameter estimates.
Another problem with multicollinearity is that the estimates will be very sensitive to changes in specification. This is a consequence from the fact that there is very little unique variation left to explain the dependent variable since most of the variation is in common between the two explanatory variables. Hence, the parameter estimates are very unstable and sometimes it can even result in wrong signs for the regression coefficient, despite the fact that it is unbiased. A wrong sign is referred to a sign that is unexpected according to the underlying theoretical model, or the prior believes based on common sense. However, sometimes we are dealing with inferior goods which means that we have to be careful with what we call "wrong" sign. Unexpected signs usually require more analysis to understand where it comes from.