SIMPLE LINEAR REGRESSION AS A PROBABILISTIC MODEL
Regression aims to model the statistical dependence between i.i.d. observations in two (or more) dimensions, say, X = Xp X2,..., Xn and Y = Y1, Y2,..., Yn. Regression says that P(X, Y) is not equal to P(X)P(Y), but that we can understand the dependence of Y on X by making a model of how the expectation of Y depends on X. At its simplest, what linear regression says is E[Y|X] = b0 + b1X. Notice that this is a predictive model: for each value of X, we have a prediction about what we expect Y to be. If we assume that P(Y|X) is Gaussian, then the expectation is just equal to the mean (i.e., E[Y|X] = p), and we can write the following likelihood function
where the product is over the i.i.d. observations, and I’ve written N() to represent the Gaussian distribution where the mean for each observation of Y depends on the corresponding observation X. As with every probabilistic model, the next step is to estimate the parameters, and here, these are b0, b1, and s. Figure 7.1 illustrates the probabilistic interpretation of simple linear regression. In the case of simple linear regression, it is possible to differentiate (the log of) this objective function with respect to the parameters to obtain closed forms for the maximum likelihood estimators.
FIGURE 7.1 The probabilistic interpretation of simple linear regression. Parameters of the model are b0, bj, and o. Circles represent observations of X and Y. Only the response (Y) variable is assumed to have “noise” in this model.
Before we proceed to find formulas for the maximum likelihood estimators (MLEs), however, I think it’s useful to consider this likelihood function a bit further. First, compare it to the case where we simply model Y using a single Gaussian: that turns out to be the case of b = 0. This is the first attractive feature of simple linear regression: when there is no relationship between X and Y (at least no relationship that can be captured by the linear model), linear regression becomes a simple Gaussian model for the variable whose values we are trying to predict.
Another important point is to consider the presence of randomness implied by the model: when we say that the expected value of Y depends on X through a simple formula, we are saying that although we expect Y to be connected to X, we accept that our observations of Y can have some randomness associated with them (which in the model is assumed to be predicted by the Gaussian distribution.) However, there is no place in this likelihood where we include the possibility of randomness in X: we are assuming that X is, in some sense, “perfectly” measured, and all of the randomness is our observations of Y. This implies that linear regression is not symmetric: if we write the model L = P(Y|X, 0), we will get a different answer than if we write L = P(X| Y, 0).