LEAST SQUARES INTERPRETATION OF LINEAR REGRESSION
Although I have presented linear regression in the context of a probabilistic model, where the expectation of the Gaussian distribution was assumed to depend on X, notice that the log-likelihood can also be written as
Maximizing the likelihood function with respect to b1 and b0 is equivalent
(Yi - b0 - bXi) . The term that is being squared
corresponds to the difference between the observed Y and the expected or predicted Y (which is just b0 + b1X). This difference is also called the “residual” for that observation—the part of Y that was not successfully explained by X. Maximizing the likelihood is therefore equivalent to minimizing the sum of squared residuals or the SSR, where
Therefore, even if you don’t believe that your data are Guassian at all, you can still go ahead and fit a standard regression model to find the “ordinary least squares” (OLS) estimators, which minimize the squared differences between the observations and the predictions. In the case of simple linear regression, the OLS estimators turn out to be the same as the ML estimators if the data are assumed to be Gaussian.
To evaluate the fit of the regression model in this context, we might be interested in asking: how much of the Ys were successfully explained by X? To make this a fair comparison, we should standardize the SSR by the total amount of variability that there was in Y to begin with: if Y never changed at all, it’s not fair to expect X to explain much at all. The standard way to do this is to compare the SSR to the standard deviation of Y using the r2
If the predictions of Y based on X were perfect, the residuals would all be zero, and the r2 (also known as R2) would be 1. The r2 can be thought of as the fraction of the variance of Y that is still present in the residuals. In the context of the “null hypothesis” of b1 = 0, this means that X didn’t predict anything about Y, and the regression was equivalent to estimating a simple Gaussian model for Y. Said another way, the r2 measures the amount of Y that can actually be explained by X, or the differences between Y|X and Y on its own. Importantly, this interpretation is valid regardless of the underlying distributions of X and Y. Conveniently, r2 is also very simple to calculate: Just take the correlation and square it! So even if your data are not Gaussian, you can still say something meaningful about how much of the variance of Y is explained by its relationship to X.