# Qualitative variables with several categories

The human capital model described above includes a continuous variable for the number of years of schooling. When including a continuous variable for schooling it is under the belief that the hourly wages are set and determined based on this measure. An alternative approach would be to argue that it is the level of schooling, the received diploma that matters in the determination of the wage rate. That calls for a qualitative variable with more than two categories. For instance:

In order to include d directly into a regression model we have to make sure that the effect of going from primary schooling to secondary schooling on the hourly wage rate is of the same size as going from secondary schooling to a post secondary schooling. If that is not the case we have to allow for differences in these two effects. There are at least two approaches to this problem.

The first and most basic approach is to create three binary variables; one for each educational level, in the following way:

We can now treat D1, D2 and D3 as three explanatory variables, and include them in the regression model. However, it is important to avoid the so called dummy variable trap. The dummy variable trap appears when the analyst tries to specify and estimate the following model:

It is a mathematical impossibility to estimate the parameters in (8.14) since there is no variation in the sum of the three dummy variables, since D1+D2+D3=1 for all observations in the data set. Since the model only can contain one constant, in this case the intercept, we can not include all three dummy variables. The easiest way to solve this is to exclude one of them and treat the excluded category as a reference category. We re-specify the model in following way:

That is, if D1 is excluded, the other categories will have D1 as reference. b2 will therefore be interpreted as the wage effect of going from a primary schooling diploma to a secondary schooling diploma, and B3

will represent the wage effect of going from a primary schooling diploma to a post secondary schooling diploma. In order to determine the relative effects you may use the transformation described by (8.8).

An alternative to exclude one of the categories is to exclude the constant term, which would give us a model that looks like this:

The three dummy variables will then work as three intercepts in this model; one for each educational level. The coefficients can therefore not be interpreted as relative changes in this case.

Example 8.5

Estimate the parameters of (8.15) and (8.16) and compare and interpret the results.

The three dummy variables represent three educational levels, and x represents the age of the individual. The first thing to notice is that B0=C1, B0+B2=C2 and B0+B3=C3. Hence, the two specifications are very much related. Furthermore C2-C1=B2 and C3-C1=B3. With help from specification II, we can derive the effect of going from a high school diploma to a college diploma by taking the difference between C3 and C2 which turns out to be equal to 0.14, i.e. a 14 percent increase. However, that effect could also have been received by taking the difference between B3 and B2. For the obvious reason there should be no change in the effect of the other variables included in the model (B4 in this example) when alternating between specification I and II.

# Piecewise linear regression

Dummy variables are also useful when modeling a non linear relationship that can be approximated by several linear relationships, known as piecewise linear relationships. In Figure 8.1 we see an example of a piecewise liner relationship. A typical example of such a relationship would be related to the income tax, which often is progressive, that is, the more you earn the larger share of your income should be paid in tax.

Let say that we are interested in describing how the income tax paid (y) is related to the gross household income (x) and we specify the following model:

In order to transform (8.19) into a piecewise linear regression we need to define two dummy variables that will describe on what linear section the household is located. We define:

Figure 8.1 Piecewise linear regression

Next re-specify the intercept and the slope coefficient in (8.19) in the following way:

A = A0 + AlDl + A2 D2

B = B0 + BD + B2 D2

(8.20) (8.21)

rain power how is crucial to running a large proportion of the nonce. These can be reduced dramatically thanks to our lubrication. We help make it more economical to create cleaner, cheaper energy out of thin air. ^| By sharing our experience, expertise, and creativity, | industries can boost performance beyond expectations.

Therefore we need the best employees who can meet this challenge!

The power of Knowledge Engineering

Substitute (8.20) and (8.21) into (8.19) and receive:

Y = (Ao + AD + A2 D2) + (Bo + BD + B2 D2) X + U

After multiply out the parenthesis we receive the following specification that could be used in estimation:

Y = A) + AD + A2 D2 + B0 X + B1 (AX) + B2 (D2 X) + U (8.22) The estimated relations in the three income ranges are therefore given by:

When X < X1 Y = a0 + b0X

When X1 < X < X2 Y = (a0 + a1) + (b0 + b1)X

When X > X2 Y = (a0 + a2) + (b0 + b2)X

# Test for structural differences

An important application using dummy variables is to test if the coefficients of the model differ for different sub groups of the population, or if they have changed over time. For instance, assume that we have the following wage equation expressed with a semi logarithmic (log-linear) functional form:

In Y = B0 + BXXX + B2 X 2 + U (8.23)

with Y being the wage rate, X1 the number of years of schooling, and X2 the number of years of working experience. We would like to know if B1 and B2 differ between men (m) and women (w) simultaneously. That is, we would like test the following hypothesis:

H0 : B1m = B1w,B2m = B2w

H x Bm * Blw and Ior B2m * B2w

In order to carry out this test using dummy variables we need to create an indicator variable D, let's say for men, and then form the following regression model:

In Y = B0 + BXD + B2 Xx + B3 (XXD) + B4 X2 + B5 (X 2 D) + U (8.24)

Equation (8.24) will be representing the unrestricted model, where men and women are allowed to have different coefficients, and equation (8.23) will be representing the restricted case where men and women have the same coefficients. We will now compare the residual sums of squares (RSS) between the two models and use those in our test statistic given by:

If the rssr is very different from rssu we will reject the null hypothesis in favor of the alternative hypothesis. If they are similar in size, the test value will be very small and we say that the coefficients are the same for men and women.

Example 8.6

Assume that would like to know if the coefficients of equation (8.23) differ between men and women in the population. In order to test the joint hypothesis we need to run two regression models; one restricted model given by (8.23) and one unrestricted model given by (8.24). Using a sample of 1483 randomly selected individuals we received the following results:

 Restricted Model Unrestricted Model Residual Sum of Squares (RSS) 145.603 140.265 Degrees of freedom (n-k) 1483 - 3 = 1480 1483 - 6 = 1477

Using the results given in Table 8.1 we may calculate the test value using (8.25). The degrees of freedom for the numerator is calculated as the difference between degrees of freedom for the RSS from the restricted and unrestricted model. That is 1480 - 1477 = 3. Another way to think about the degrees of freedom for the numerator is to express it in terms of the number of restrictions imposed by the restricted model compared to the unrestricted. The unrestricted model has 6 parameters, while the restricted model has only 3, which means that three parameters have been set to zero in the restricted model. Therefore we have 3 restrictions.

The test value using the test statistic is therefore equal to:

The test value has to be compared with a critical value. Using a significance level of 5 percent the critical value equals 2.6. Hence, the test value is much larger than the critical value which means that we can reject the null hypothesis. We can therefore conclude that the coefficient of the regression model differ for men and women.