DERIVING THE MLEs FOR LINEAR REGRESSION
Let’s now go ahead and derive the MLEs for some of the parameters in the simple linear regression model. As usual, we first take the log of the likelihood. We have
We next take the derivatives with respect to each of the parameters and set these derivatives to zero. For example,
where the first term is zero because it did not depend on b0, and the derivative of the squared term is just -1. Notice that since o2 doesn’t depend on i, we can take it out of the sum and multiply both sides of the equation by o2.
Finally, the sum of b0 from i = 1 to n is simply n times b0.
We can now solve for b0 to get
where I’ve done a bit of rearranging for clarity. Consider this equation in the context of the “null hypothesis” for linear regression, namely that b1 = 0. This equation says that under the null hypothesis, b0 is just the average of Y, which, as we have seen in Chapter 4, turns out to be the MLE for the p parameter of the Gaussian distribution. This makes sense based on what I already said about the regression model.
Notice that as often happens, the MLE for b0 depends on b1, so that to maximize the likelihood, we will have to simultaneously solve the equation
where the first term is zero because it did not depend on b1, and the derivative of the squared term is -X;. Once again, we can take o2 out of the sum and multiply both sides of the equation by o2, leaving
We can solve for b1
Luckily, the equation for b1 only depends on b0, so that we have two equations and two unknowns and we can solve for both b0 and b1. To do so, let’s plug the equation for b0 into the equation for b1.
After algebra, I have
which can be solved to give
where, to avoid writing out all the sums, it’s customary to divide the top and bottom by n, and then put in mX to represent various types of averages. This also works out to
where
s are the standard deviations of each list of observations
r is the “Pearson’s correlation” r (X ,Y) = d := (Xi - mx )(Y - mY )j/sxSy
This equation for b1 is interesting because it shows that if there is no correlation between X and Y, the slope (b1) must be zero. It also shows that if the standard deviation of X and Y are the same (sY/sX = 1), then the slope is simply equal to the correlation.
We can plug this back in to the equation for b0 to get the second MLE: