Properties of the fitted OLS line

Its properties are the following: first, the line passes through the sample means of Y and X; second, the mean value of estimated Y, Y, equals the mean value of the actual Y, Y;

Fitted Ordinary Least Squares

Figure 14.4 Fitted Ordinary Least Squares (OLS) line third, the mean value of the residuals, e is zero; fourth, the residuals e are uncorrelated with the predicted Y., i.e. EYj e. = 0; fifth, the residuals e are uncorrelated with X. i.e. Ее X = 0.

t I

The problem of statistical inference

Suppose the following sample regression line y. = b, + b, X. is fitted as an estimate of the population line Y. = (3, 4- (3, X 4- e.. It may be of interest to test whether, (3„ say, takes a particular value, such as (3, = 1. Hypothesis testing procedures are needed for this. In Hypothesis testing, the following steps are carried:

(1) Specify the Null (HM): e.g. /3, = 0

and the Alternative (H(): /3, ^ 01 or /3, > 01 or /3, < 01

(2) Specify the level of significance a (and the level of confidence, 1-ot), usually 5%,

or 1°% b p

(3) Choose a test statistic e.g. z = - ~ N(0,1)


  • (4) Perform calculations to find z"
  • (5) Decision (compare z" with z""). With respect to the latter, for a z test statistic,

z = ——— (that is, when the population variance, a2 is known), testing H0: /3 = se(b2)

/3., do not reject the Null:

for a two tail test (H.: 3, ^ 00, when — z""2 < z,lfc < 4- z'”'n for a right tail test (H,: 3, < /3”), when zobs < + z‘f for a left tail test (H.: 3, > 01 ), when zobs > — z"'


In testing hypothesis about pv we know from the distribution of bt, f>, ~ N(3,, a1 ^ -y , ),

b 3 [1]

and the z-transformation yields the test statistic z = ———— N(0,1), where, se(b.) = _ se(bt)



Similarly, since the distribution of b, is, b, ~ N(3,. ), the appropriate test statistic

b-,0 a2

is z = ~ ' ~ N(0,1), where, se(b,) = ^ ,

se(b2) -

The assumptions underlying the linear regression model are summarised in the following equation:

That is:

  • • X. are exogenous or predetermined, or equivalently the X are orthogonal (independent) of the e.. That is, C(X(, e) = E(Xi e) = 0.
  • • Homoskedasticity. That is, the conditional variance of Y, or equivalently of the error-term, is constant (and equal to a2). Mathematically, E(c“) = n~.
  • • No-Autocorrelation or Independence of Y, or equivalently of e, which is a result of random sampling. Mathematically, E(Ej £^ = 0.
  • • Stability of the parameters of interest f3{ and /3, (also of O'2, i.e. homoskedasticity') over the period of estimation.
  • • The number of observations must be at least equal to the number of estimated coefficients for estimation.
  • • Non-zero variation in the independent variable is crucial to enable estimation of the coefficients and their standard errors.
  • • Normality of Y or equivalently of £.

It is important that the previous assumptions are true if correct inferences are to be made

(that is, to be able to place any confidence in our results) from an estimated regression

model. Tests exist which enable one to check the validity of these assumptions.

Goodness of fit: R2 – The coefficient of determination

Consider the regression model in mean deviation form; y| = b,A'i+ei. The sum of squared residuals Eef could provide a measure of fit (the Ее2 = E(Y — Y()2 is, the bet-


ter is the fit; the closer Y, is to Y.). However, these are affected by the scaling of variables.

Fitted OLS line and errors

Figure 14.5 Fitted OLS line and errors

Note: Variation for i observation, Y, is Y — Y- However, when calculating total variation, over all observations, i.e. £(Y — Y), then £(Y( — Y) = 0. Thus, use £(Y, — Y)' as a measure of variation. From this, for each observation: yf — yf + ef + 2 yt ei

Variation over all observations is: £ yf = £ yf + £ ef 4- 2 £ yt et.

Since T. yiei = b2 £ xt c = 0 , £ yf = £ yf + £ ef = bf £ xf + £ ef

Total Sum of Squares = Explained Sum ofSq. + Residual Sum ofSq. i.e., TSS = ESS + RSS

Alternatively: Goodness of fit may be thought of as the variation in y. around their mean value explained by the regression inodel (that is, by the variation in x). Ideally, we would like all the variation in y(. (= Y| — Y) to be explained by the fitted values, y(; that is, for the actual y. to be on the line. Thus, for each observation:

Define the coefficient of determination R2 as the proportion of the variation in Y explained by the regression line.

_ . „,. , ESS E у2 l)2 Ел-2 Ec2

That is, R- is: R- = - = —Ц- —-L. = 1--

TSS E y, Ey" Ey"

R1 may also be calculated as: R2 = -—

Ey- Ey-

Example: Performing a univariate regression analysis in Microsoft Excel is relatively simple: Arrange the data for the dependent (Y) and independent (X) variables in two columns; then select the tab “Data”, Data analysis, Regression and specify Y and X. In order to illustrate the regression model with shipping market data, we regress monthly growth rates of Capesize five-year second-hand vessel prices (Y) on monthly growth rates of Capesize voyage earnings (X), over the period July 2009 to May 2020, yielding 130 monthly return observations in total for each time-series. Choosing the dependent and independent variables is typically based on economic reasoning. Specifically, as discussed in Chapters 1 and 2 of this book, vessel prices are expected to be linked with the income generated from the operation of the vessel (earnings). In this case the relationship between the monthly percentage changes rather than the levels of these variables, per the theory, are examined.

Table 14.20 presents the output of Microsoft Excel for this regression model, while Figure 14.6 shows the OLS fitted line. As observed in the table, the R-square (R2) statistic is equal to 0.024947 (or 2.50%), which means that the fluctuations of the independent

Table 14.20 Excel regression output

Regression statistics

Multiple R


R Square


Adjusted R Square


Standard Error







t Slat




























OLS line fit for Capesize five-year vessel prices (Cape_5yr_Price) on Capesize voyage earnings (Cape_Spot)

Figure 14.6 OLS line fit for Capesize five-year vessel prices (Cape_5yr_Price) on Capesize voyage earnings (Cape_Spot)

variable (the monthly return of Capesize voyage earnings) were able to explain only 2.50% of the fluctuations observed in the dependent variable (the monthly returns of Capesize five-year second-hand price). This is also evident in the figure, which shows a low fit of the regression line, in the sense that the data are widely scattered around the fitted line.

The estimated coefficient of the constant term (intercept with the vertical axis) equals —0.00569 while that of the slope equals 0.013369. The t-statistic of the slope is equal to 1.809685 and the associated p-value is equal to 0.072691. Therefore, at the 10% significance level (a), the null hypothesis of the estimated coefficient (3,) to be equal to zero is rejected, since 0.072691 < 0.10. In other words, the coefficient of the variable of interest (slope) is statistically significant at the 10% significance level. Note that at the 5% significance level, which is typically used in empirical analyses, the 3, coefficient is not statistically significant as 0.072691 > 0.05. However, the coefficient of the intercept is not statistically significant at any reasonable level of statistical significance as the p-value is 0.198231, far larger than the 0.10 and 0.05 significance levels.

Extension of results to multivariate regression

In practice, a variable of interest is affected by more than one other variable. For example, the demand for electricity does not depend only on the price of electricity. It also depends on the price of gas (since gas is a substitute for electricity in consumption), the consumers’ income (for normal goods demand increases as income rises), and other factors such as the weather (demand is higher in cold weather), the time of the day (lunch time demand is higher than after midnight), etc.

In order to take account of such situations, the results derived in the Sections up to for the bivariate regression model need to be extended to a multivariate framework. In what follows, the emphasis is in understanding how to read and interpret the results of multivariate regressions, rather than derive the formulas that would calculate coefficients, their standard errors and other related problems. This will help understand and interpret regression results presented in published work. Multivariate regression results should be thought of in the form of an equation like:

where, Y is the dependent variable. It is the variable that we think can be explained in terms of all the Xs. The latter are the explanatory or independent variables. In the previous equation there are k- of them.

The bs are the estimated coefficients of the regression line, b0 is the “constant term” (it is the expected average value of Y if all the Xs are zero). Each of the other bs are referred to by naming the explanatory variable they multiply. Thus, b, is “the coefficient ofX”, b, is “the coefficient of X” and so on. Each coefficient is like a slope. It measures the effect of the explanatory variable it multiplies on the dependent variable, other things being equal (ceteris paribus). They are thus the partial derivatives of Y with respect to each of the Xs. Estimating a regression line involves feeding the data observations (values for Y and the Xs) and getting back values for all the bs. The formulas to calculate the values of the coefficients and their standard errors are much more complicated than in the simple bivariate (two variable) regression case. This is because in these formulas it is recognised that the explanatory variables may be related between them.

  • [1] Correct model specification of the conditional mean of Y (linearity as in the previous equation). As a consequence, E(s) = 0.
< Prev   CONTENTS   Source   Next >