# Measuring causal relationships between variables – simple and multiple regression analysis

Regression analysis is the most widely used statistical technique with applications across a range of topics and fields. Essentially, regression analysis is an attempt to explain movements in a variable by reference to movements in one or more other variables. Thus, regression analysis aims to establish causality between two variables rather than simply examine if two or more variables correlate (move together). Ideally, a regression analysis aiming to establish causality between two variables of interest will be based on economic- theory, common-sense and past-experience, as they often help to determine the set of conditioning variables.

For example, Kavussanos and Marcoulis (1997) estimate a regression model to explain stock returns of US listed water transportation companies, Y, with the stock market and a set of micro factors, X_{(} X„ ..., X . If we only concentrate on one explanatory variable, say the stock market return X, then mathematically:

Such a model incorporates only one feature of the relationship, in the sense that there are likely to be other variables, such as a set of microeconomic factors influencing stock returns (e.g. asset to book ratio, the size of the company, etc.). Y then is likely to be stochastic (random) rather than deterministic. That is, for each individual *i,*

where e is a random — stochastic — component which is zero on average; that is, it consists of many small factors (other than X), some positive, some negative, but on average 0. Thus, Y is a random variable with a conditional mean:

and assumes a conditional variance:

That is, Y is a random variable with a conditional distribution:

According to this model then, Y is a random variable with a systematic part f(X) and a non-systematic part £. f(X) and e are orthogonal (uncorrelated). Mathematically, E[f(X) | £ | = 0. The relationship written previously is also known as the Statistical Generating Mechanism (SGM); that is, the mathematical form of the function that purports to describe the true Data Generating Process (DGP). Having defined the SGM and having identified its systematic and non-systematic parts, consider more carefully the stochastic component of the equation:

Note that V(f(X)) = 0 since f(X) is assumed non-stochastic and it is known that a constant has zero variance. Thus, £ is a stochastic random variable whose mean is 0 and its variance is ct^{2}. In summary notation,

which, is a result of the correct specification of the SGM. If Y are generated independently from each other, C(Y_{i},Y_{j}) = 0 and С(£_{(},е) = 0. For estimation purposes the mathematical form of f(X) must be determined. Assuming linearity, the SGM becomes:

The results shown previously remain invariant with respect to the latter assumption; that is, in the linear bivariate regression model:

In short hand:

As a result, for the non-systematic part of the model, e.

In short hand: e_{(} ~ (0, ct^{2}) and due to independence Cov(e., *z) =* 0.

In order to estimate the coefficients of the regression line we need to have matching observations for Y and X. Usually all the observations of the population data are not available, so a representative sample is required. Take a random sample of n observations from the population — the distribution of Y. Y_{|;} . . . ,Y_{u} constitutes a random sample if the values are drawn independently from the same probability distribution. In short-hand notation: Y_{;} are i.i.d. (independently and identically distributed).

## Deriving the OLS (Ordinary Least Squares) estimators

and given a random sample of и observations for Y and X (the sampling model), the aim of estimation is to find b_{(} and b, as estimators of (3, and 3,, respectively. That is, find the

*Statistical tools for risk management* 473 *“best”* line through the set of sample data. This may be done by minimising the following objective function S = У~У^{2}. Therefore:

I

That is, since the aim is to forecast Y in terms of X, then for each value of X one wants to minimise the variation (errors) in Y from the mean, i.e. the straight line through the data. These points on the line are the fitted values, *Y,* and therefore the errors of Y are measured in the vertical direction and the residuals (the sample equivalents of the error- terms) are e_{(} = Y — *Y..*

The following First Order Conditions (FOCs) apply:

Which when solved simultaneously give the following OLS estimators of the regression coefficients:

These are the Ordinary Least Squares (OLS) estimators b. , b, of Д, /?,. Once they are estimated, the sample regression line, *Yi* = b, + b, X, may be fitted through the data as presented in Figure 14.4.