THE DISTRIBUTION OF PARAMETER ESTIMATES FOR MLEs

Assuming that you have managed to maximize the likelihood of your model using either analytic or numerical approaches, it is sometimes possible to take advantage of the very well-developed statistical theory in this area to do hypothesis testing on the parameters. The maximum likelihood estimator is a function of the data, and therefore it will not give the same answer if another random sample is taken from the distribution. However, it is known that (under certain assumptions) the MLEs will be Gaussian distributed, with means equal to the true means of the parameters, and variances related to the second derivatives of the likelihood at the maximum, which are summarized in the so-called Fisher Information matrix (which I abbreviate as FI).

This formula says that the variance of the parameter estimates is the (1) expectation of the negative of (2) inverse of the (3) Fisher information matrix evaluated at the maximum of the likelihood (so that all parameters have been set equal to their MLEs). I've written the numbers to indicate that getting the variance of the parameter estimates is actually a tedious three-step process, and it's rarely used in practice for that reason. However, if you have a simple model, and don't mind a little math, it can be incredibly useful to have these variances. For example, in the case of the Gaussian distribution, there are two parameters (|j and a), so the Fisher information matrix is a 2 x 2 matrix.

The first step in getting the variance of your estimator is evaluating these derivatives. In most cases, this must be done numerically, but in textbook examples they can be evaluated analytically. For the Gaussian model, at the maximum of the likelihood they have the simple formulas that I've given here.

The second derivatives measure the change in the slope of the likelihood function, and it makes sense that they come up here because the variance of the maximum likelihood estimator is related intuitively to the shape of the likelihood function near the maximum. If the likelihood surface is very flat around the estimate, there is less certainty, whereas if the MLE is at a very sharp peak in the likelihood surface, there is a lot of certainty—another sample from the same data is likely to give nearly the same maximum. The second derivatives measure the local curvature of the likelihood surface near the maximum.

Once you have the derivatives (using the values of the parameters at the maximum), the next step is to invert this matrix. In practice, this is not possible to do analytically for all but the simplest statistical models. For the Gaussian case, the matrix is diagonal, so the inverse is just

Finally, once you have the inverse, you simply take the negative of the diagonal entry in the matrix that corresponds to the parameter you're interested in, and then take the expectation. So the variance for the mean would be a2/n. This means that the distribution of ^MLE is Gaussian, with mean equal to the true mean, and standard deviation is equal to the true standard deviation divided by the square root of n.

You probably noticed that this calculation also tells us the distribution for the MLE of the variance. Since the estimate of the variance can only be positive, it's in some sense surprising that statistical theory says that it should have a Gaussian distribution (which we know gives probabilities to negative numbers). The resolution of this contradiction is that one of the assumptions under which the MLEs approach Gaussian distributions is that the sample size is very large, which limits the applicability of the theory in many practical situations. For small sample sizes, the distribution of the variance estimate is not very Gaussian at all. In fact, it has a nonstandard distribution.