# Omission of a relevant variable

In chapter 3 we described how the error term could be seen as collection of everything that is not accounted for by observable variables included in the model. We should also remember that the first assumption related to the regression model concerns the fact that all that is relevant should be included in the model. What are the consequences of not including everything that is relevant in the model?

In order to answer that question we need to know the meaning of the word relevance. Unfortunately it has several meaning and we usually make the distinction between statistical and economic relevance. Statistical relevance refers to whether the coefficient is significantly different from zero or not. That is, if we are able to reject the null hypothesis. If we are unable to reject the null hypothesis we say that the variable has no statistical relevance.

The economic relevance is related to the underlying theory that the model is based on. Variables are included in the model because the economic theory says they should be. That some of the variables are not significantly different from zero is not a criterion for exclusion. It is the economic relevance that makes the omission of a relevant variable problematic. To see this consider the following two specifications:

The correct economic model: Y = B0 + Bx Xx + B2 X2 + U (7.14)

The estimated model: Y = b0 + bx Xx + e (7.15)

From chapter 3 we know that the sample estimator for the slope coefficient in the simple regression model is given by: which may be rewritten as Simplify and take the expectation of the estimator: Hence, the estimator is not unbiased anymore. The expected value of the estimator is a function of the true population parameter b1 and the true population parameter b2 times a weight that persist even if the number of observations goes to infinity. Failure to include all relevant variables therefore makes the coefficients of the included variables biased and inconsistent. However, if the excluded variable is statistically independent of the included variable, that is if the covariance between x1 and x2 is zero, exclusion will not be a problem, since the second component of (7.18) will equal zero, and the estimator will be unbiased. If the model includes several variables and one relevant variable is excluded, the bias will affect all the coefficients as long as the corresponding variables are correlated with the excluded variable.

A common example of this kind of bias appears in the human capital literature when they try to estimate the return to education on earnings without including a variable for scholastic ability. The problem is common since most data set does not include such information and that scholastic ability is correlated with the number of years of schooling as well as earnings. Since scholastic ability is believed to be positively correlated with schooling as well as with the earnings, the rates of returns to education are usually overestimated, due to the second component in (7.18).

# Inclusion of an irrelevant variable

Another situation that often appears is associated with adding variables to the equation that are economically irrelevant. The researcher might be keen on avoiding the problem of excluding any relevant variables, and therefore include variables on the basis of their statistical relevance. Some of the included variables could then be irrelevant economically, which have consequences on the estimated coefficients. The important question to ask is what those consequences are. To see what happens when including economically irrelevant variables we start by defining two equations:

The correct economic model: Y = B0 + B±X± + U (7.19)

The estimated model: Y = b0 + b1 X1 + b2 X2 + e (7.20)

The estimated model (7.20) includes two variables, and X2 is assumed to be economically irrelevant, which means that its coefficient is of minor interest. The OLS estimator of the coefficient for the other variable is given by: Substitute (7.19) for Y and take the expectation to obtain Hence, the OLS estimator is still unbiased. However, the standard error of the estimator is larger when including extra irrelevant variables, compared to the model where only the relevant variables are included, since more variation is added to the model. Therefore, the price of including irrelevant variables is in efficiency and the estimator is no longer BLUE. On the other hand loss in efficiency is less harmful compared to biased and inconsistent estimates. Therefore, when one is unsure about a model specification, one is better off including too many variables, than too few. This is sometimes called kitchen-sink regressions.

# Measurement errors

Until now we have assumed that all variables, dependent as well as independent, have been measured without any errors. That is seldom the case and therefore it is important to understand the consequences it has on the OLS estimator. We are going to consider three cases: measurement error in Y only, measurement error in X only, and measurement error in both X and Y.

In order to analyze the consequences of the first case we have to assume a structure of the error. We assume that the measurement error is random and defined in the following way: where Y* represent the observed variable, Y the true, and e the random measurement error that is independent of Y, with a mean equal to zero and a fixed variance a], . Assume the following population model and substitute (7.22) with Y: The new error term U* would still be uncorrelated with the independent -variable X, so the sample estimators would still be consistent and unbiased. That is, we have However, the new error would have a variance that is larger than otherwise, that is, V(e + U) = cU +cr2 . Remember that the measurement errors random which imply that the population error term is uncorrelated with the measurement error. Hence the two variances only add to a larger total variance, which affects the standard errors of the estimates as well. The conclusion is that random measurement errors in the dependent variable do not matter much in practice.

In the second case the measurement error is attached to the independent variable, still under the assumption that the error is random. Assume that the observed variable is defined in the following way: with an error component that is independent of X, has a mean zero and a fixed variance, a] . When the observed explanatory variable is defined in this way trie population regression equation is affected in the following way. The model we would like to study is defined as but we only observe X*, which implies that the model become The mean value of the new error term is still zero, and the variance is some what inflated compared to the case with no me assortment error. That i s, V{U*) = V(U - Bxs) = a2 + B2cr2 . Unfortunately the new error term is no longer uncorrelated with the explanatory variation. The measurement error creates a correlation that is different drone euro, that bias the OLS estimators. That is Hence, the covariance is different from zero if there is a linear regression relation between x and Y. The only way to void this problem is to force the variance of the measurement error to zero. This is of course difficult in practice.

The third case considers the combined effects of measurement errors in both the dependent and independent variables. This case add nothing new to the discussion since the effect will be the same as when just the explanatory variables contains measurement errors. That means that the OLS estimators are both biased and inconsistent, and the problem drives primarily from the error that comes from the explanatory variable.