DERIVING THE DISTRIBUTION OF THE MLE FOR b
It's a useful exercise to do the calculations to obtain the variance of the MLE for bv As I mentioned in Chapter 4, we need the matrix of the second derivatives of the likelihood, evaluated at the maximum of the likelihood. The matrix of second derivatives turns out to be
To show you how I got all those second derivatives, we will need the derivatives of the likelihood with respect to a and the MLE for this parameter.
Setting this to zero and solving will give us the MLE for the "noise" parameter in the linear regression
To get the partial derivatives, the order doesn't matter, so we can do either derivative first (the matrix will be symmetric). Using the formula for the derivative of the likelihood with respect to b0 from above, we can we take the second derivative with respect to a as follows:
In the last two steps, I did something very clever: I wrote the second derivative in terms of the first derivative that we calculated. Because the distribution of the MLEs is related to the matrix of second derivatives, at the maximum of the likelihood, we know that the first partial derivative must be zero—that's how we know we're at the maximum—and that's how we derived the MLEs. You can see that the same story will be true for the partial derivatives with respect to a and b-, as well.
Not all the second derivatives turn out to be zero. For example, here's the second derivative of the likelihood with respect to b1:
where again I used the formula we already had for the first derivative. A little more tricky is the second derivative with respect to the noise parameter:
Although this looks bad, remember that at the maximum, (3logZ.)/(3a) = 0, and we already found that this means that
Therefore, {{Y; - b0 - bX;)2/ct2) must actually be just n. We have
Putting all second derivatives together, we get the matrix shown earlier. But we still need the inverse of this matrix, which depends on the determinant. The determinant of a 3 x 3 matrix is a messy thing, but because of the all the zeros, it's reasonable to calculate it in this case.
I left some spaces to show where the various terms in the inverse came from. The inverse can be simplified by factoring out 2n and rewriting in terms of Sx, the variance of the data X.
Finally, the variance of the MLE for b1 is the expectation of the negative of the middle entry in the matrix E[-{FI-)22] = s2/{nsX). Taking the expectation has no effect in this case because there are no random variables left in our formula (the expectation of the variance is just the variance, and the expectation of our estimator is just the estimator). So the distribution of the MLE is a Gaussian with the mean parameter equal to b1 and the standard deviation equal to стДfnsX. Notice that the standard deviation of the MLE decreases proportional to the square root of the number of datapoints n, and that the variance depends on the MLE for the noise parameter: the further the data are from the regression line, the larger the variance of our estimate of the slope. Another interesting point here is that the off-diagonal terms of the matrix in the rows for b0, and b1 are not zero. This means that the joint distribution of b0 and b1 is a multivariate Gaussian with nondiagonal covariance. Although in our model the parameters are independent, our estimates of the parameters are not.
Even more powerful than using the null hypothesis, bx = 0, is using the related null hypothesis r = 0. Although the formula for the correlation is a bit complicated, one important thing to notice is that it is symmetrical in X and Y. So although regression in general is not symmetric in X and Y, the symmetrical part of the relationship between X and Y is captured by the correlation. This is one of the reasons that the correlation is a very useful distance measured between two vectors (as discussed in Chapter 5.) Perhaps even better is that the distribution of the correlation is known (approximately) under the null hypothesis of no association between X and Y. Assuming that X and Y are truly Gaussian, but are independent, then the statistic
where n is the number of observations, has a t-distribution with n - 2 degrees of freedom. This means that, given two lists of numbers, X and Y, you can go ahead and test whether there is an association between them without having to assume anything about which one causes the other.
It’s important to remember that the P-value approximation for the correlation and the asymptotic normality of the estimate of b do assume that the data are at least approximately Gaussian. However, in the context of hypothesis testing, there is a beautiful way around this: replace the actual values of X and Y with their ranks and compute the Pearson correlation on the ranks. It turns out that the distribution of this correlation (known as the “Spearman correlation”) under the null hypothesis is also known! The Spearman correlation is widely used as a nonparametric test for association without any assumptions about the distribution of the underlying data.
As with the rank-based tests we discussed in Chapter 2, dealing with tied ranks turns out to make the formulas for the Spearman correlation a bit complicated in practice. Nevertheless, this test is implemented correctly in any respectable statistics software.