The econometric model
We now have an economic model and we know how to interpret its parameters. It is therefore time to formulate the econometric model so that we will be able to estimate the size of the population parameters and test the implied hypothesis. The economic model is linear so we will be able to use linear regression analysis.
The function expressed by (3.1) represents an average individual. Hence when we collect data, individuals will typically not fall on the regression line. We might have households with the same disposable income, but with different level of food expenditures. It might even be the case that not a single observation is located on the regression line. This is something that we have to deal with. For the observer it might appear that the single observations locate randomly around the regression line. In statistical analysis we therefore control for the individual deviation from the regression line by adding a stochastic term (U) to (3.1), still under the assumption that the average observation will fall on the line. The econometric model is therefore:
The formulation of the econometric model will now be true for all households, but the estimated population parameters will refer to the average household that is considered in the economic model. That is explicitly denoted by the subscript i, that appear on Y, X and U but not on the parameters. We call expression (3.2) the population regression equation.
Adding a stochastic term may seem arbitrary, but it is in fact very important and attached with a number of assumptions that are important to fulfill. In the literature the name for the stochastic term differ from book to book and are called error term, residual term, disturbance term etc. In this text we will call the stochastic term of the population model for error term and when talking about the sample model we will refer to it as the residual term.
One important rational for the error term already mentioned is to make the equality hold true in equation (3.2) for all observations. The reason why it does not hold true in the first place could be due to omitted variables. It is quite reasonable to believe that many other variables are important determinants of the household food expenditure, such as family size, age composition of the household, education etc. There might in fact be a large number of factors that completely determines the food expenditure and some of them might be family specific. To be general we may say that:
with k explanatory factors that completely determine the value of the dependent variable Y, where disposable income is just one of them. Hence, having access to only one explanatory variable we may write the complete model in the following way for a given household:
Hence everything left unaccounted for will be summarized in the term U, which will make the equality hold true. This way of thinking of the error term is very useful. However, even if we have access to all relevant variables, there is still some randomness left since human behavior is not totally predictable or rational. It is seldom the ambition of the researcher to include everything that accounts but just the most relevant. As a rule of thumb one should try to have a model that is as simple as possible, and avoid including variables with a combined effect that is very small, since it will serve little purpose. The model should be a simplistic version of the reality. The ambition is never to approach the reality with the model, since that will make the model too complicated.
Sometimes it might be the case that you have received data that has been rounded off, which will make the observations for the variable less precise. Errors of measurement are therefore yet another source of randomness that the researcher sometimes has no control over. If these measurements errors are made randomly over the sample, it is often of minor problem. But if the size of the error is correlated with the dependent variable it might be problematic. In chapter 7 we will discuss this issue thoroughly.
The assumptions of the simple regression model
The assumptions made on the population regression equation and on the error term in particular is important for the properties of the estimated parameters. It is therefore important to have a sound understanding of what the assumptions are and why they are important. The assumptions that we will state below is given for a given observation, which means that no subscripts will be used. That is very important to remember! The assumptions must hold for each observation.
Assumption 1: Y = B0 + B1X1 + U
The relation between Y and X is linear and the value of Y is determined for each value of X. This assumption also impose that the model is complete in the sense that all relevant variables has been included in the model.
Assumption 2: e[y | X] = B0 + B1X1 e[u I X ] = e[u ]=0
The conditional expectation of the error term is zero. Furthermore, there must not be any relation between the error term and the x variable, which is to say that they are uncorrelated. This means that the variables left unaccounted for in the error term should have no relationship with the variable x included in the model.
Assumption 3: V[Y] = V[U] = a2
The variance of the error term is homoscedastic, that is, the variance is constant over different observations. Since y and u only differ by a constant their variance must be the same.
Assumption 4: Cov(Ui,U}-) = Cov(Yi ,Yj) = 0 i Ф j
The covariance between any pairs of error terms is zero. When we have access to a randomly drawn sample from a population this will be the case.
Assumption 5: x need to vary in the sample.
x can not be a constant within a given sample since we are interested in how variation in x affects variation in Y. Furthermore, it is a mathematical necessity that x takes at least two different values in the sample. However, we are going to assume that x is fixed from sample to sample. That means that the expected value of x is x itself (like a constant), and the variance of x must be zero when working with the regression model. But within a sample there need to be variation. This assumption is often imposed to make the mathematics easier to deal with in introductory texts, and fortunately it has no affect on the nice properties of the OLS estimators that will be discussed at the end of this chapter.
Assumption 6: u is normally distributed with a mean and variance.
This assumption is necessary in small samples. The assumption affects the distribution of the estimated parameters. In order to perform test we need to know their distribution. When the sample is larger then 100 the distribution of the estimated parameters converges to the normal distribution. For that reason this assumption is often treated as optional in different text books.
Remember that when we are dealing with a sample, the error term is not observable. That means it is impossible to calculate its mean and variance with certainty, which makes it important to impose assumptions. Furthermore, these assumptions must hold true for each single observation, and hence using only one observation to compute a mean and a variance is meaningless.