Measuring the degree of multicollinearity
Three measures of the degree of multicollinearity are often suggested in the literature: the use of a correlation matrix, the Variance Inflation Factor (VIF), and the tolerance measure. All statistical measures have their limitations, and therefore it is always useful to use several measures when investigating the statistical properties of a data set.
Assume that we would like to estimate the parameters of the following model:
We suspect that the variables are highly correlated and would like to investigate the matter. A natural starting point would be to look at a simple correlation matrix to investigate the pair wise correlations between the variables. That is easily done using any statistical software. Having access to a random sample of 20 observations we received the following results:
Table 11.1 A correlation matrix for the explanatory variables in (11.6)
As can be seen from Table 11.1 some of the variables are highly correlated with each other, such as X1 and x2. x1 is also correlated with x3 but to a much lower degree and the correlation between x2 and x3 is basically zero. From the results of the table we can be sure that B1 and B2 will be difficult to estimate with any good precision, since the correlation between X1 and X2 will inflate their standard errors.
To further analyze the multicollinearity we turn to the next measure which is called the Variance Inflation Factor (VIF). It is defined in the following way:
where R2 is the squared multiple-correlation coefficient. The squared multiple-correlation coefficient for a specific parameter is a measure of the linear strength between a variable X and the rest of the variables included in the model. The squared multiple-correlation coefficient is nothing else than the coefficient of determination received from a auxiliary regression made for each variable against the other variables in the model. That is
When the model contains only two explanatory variables the squared multiple correlation will coincide with the squared bivariate correlation coefficient between the two variables in the model. If we look at (11.5) we will see that the variance inflation factor is included in that expression, and is the factor that is multiplied with the variance of the coefficient of the simple regression model. Hence it is a measure that relates to the case of no correlation, and how the variance is inflated by imposing the correlation. The expression for the variance in the case of more than two explanatory variables has a similar expression but with the squared multiple correlation coefficient instead.
VIF takes values from 1 up to any large number. The closer the multiple-correlation coefficient is to one, the larger the value of VIF. Part of the definition of VIF, is the other multicollinearity statistic called the tolerance. The tolerance measure is the denominator of the VIF expression. Since the square of the multiple-correlation coefficient is a coefficient of determination we could interpret it as such. That means that we have a measure of how large share of the variation in one variable that is explained by a group of other variables. Hence, 1 minus this share would be interpreted as how much of the variation that is left unique for the specific variable and could be used to explain the dependent variable.
When are VIF and the tolerance an indication of a multicollinearity problem? We can shed some lights on that question by an example. Let us go back to model (11.6) and check the VIF and tolerance condition for the variables in that case. Most statistical software has routines for this, and hence we should not need to run the auxiliary regression by our self. Using one such routine in SPSS we received the following regression results and collinearity statistics:
Table 11.2 Regression results for (11.6)
From Table 11.2 we see that none of the coefficients are significantly different from zero. We also observe that the coefficient of determination is above 80 percent. That implies that 80 percent of the variation in the dependent variable is explained by the three explanatory variables included in the model. Furthermore, the test of overall significance of the model is highly significant which is in line with the measure of fit. The picture described here is a good example of the consequences of high correlation between the involved variables. It will blow up the standard errors of the model even though the model as such has explanatory power.
When we are exposed to a situation like this, we must go on and calculate some multicollinearity statistics such as VIF and the tolerance for each variable. In the table we see that VIF take large values for all variables if we compare to the case of no correlation that results in VIF=1. From the analysis of the pair wise correlations we know that the reason for X3 to have a relatively large VIF number is mainly due to the correlation with X2, since the correlation with X1 is very weak. Even so, VIF for X3 is 104 and the corresponding tolerance is as low as 0.009 which means that X1 and X2 only leave 0.9 percent unique variation left of X3s total variation that can be used in explaining the variation in Y. The remaining variation was in common with the other two variables and hence must be disregarded. One should observe that even though the par-wise correlations are relatively low, their multiple-regression correlation is much higher, which emphasize the shortcomings of only looking at pair-wise correlations.
With a given specification and data set there is not much one can do about the multicollinearity problem. It could therefore be seen as a state of nature in which data offers no information about some hypothesis that could have been tested using f-tests for the parameters of the model.
Doing nothing is most often not a very attractive alternative. If the alternative of receiving more data is possible, it would be a good solution. It would not solve the multicollinearity problem, but the small unique variation that exist, will be based on more data and if the increase in the number of observations is large enough, it could help increasing the precision of the estimators. To receive more data, and sometimes very much more data, is often very costly and/or time consuming and therefore often not an alternative.
Another alternative would be to change the variable specification. One way of doing that would be to drop one of the variables. If we, in the first place, had an economic relevant specification we know that the estimated parameters will be biased and inconsistent if dropping a relevant variable. Hence, we would only replace one problem with another and this alternative is therefore in general not very attractive.
An alternative approach would be to rethink the model so that it could be expressed in an alternative way. One way of doing that could be to categorize one of the problematic variables. In our example discussed above it was problematic to include x1 and x2 in the same regression. But if we replace x1, which is a continuous variable, with level indicators (dummy variables) instead it would hedge the strong correlation with x2 and increase the precision of the estimates. However, that would mean a slightly different model, and we have to be willing to accept that.
In the literature there are other more or less restrictive methods described to handle this problem, and non of them are very convincing in there way of reducing the problem. We will therefore not go into any of those more advanced techniques here, since they require more statistical knowledge that is beyond the scope of this text. But remember, multicollinearity is a state of nature, and is therefore not something that you solve, but instead something that you have to live with.