HOW TO MAXIMIZE THE LIKELIHOOD ANALYTICALLY
Although numerical approaches are always possible nowadays, it’s still faster (and more fun!) to find the exact mathematical maximum of the likelihood function, if you can. We’ll derive the MLEs for the univariate Gaussian likelihood introduced earlier. The problem is complicated enough to illustrate the major concepts of likelihood, as well as some important mathematical notations and tricks that are widely used to solve statistical modeling and machine learning problems. Once we have written down the likelihood function, the next step is to find the maximum of this function by taking the derivatives with respect to the parameters, setting them equal to zero, and solving for the maximum likelihood estimators. Needless to say, this probably seems very daunting at this point. But if you make it through this book, you’ll look back at this problem with fondness because it was so simple to find the analytic solutions. The mathematical trick that makes this problem go from looking very hard to being relatively easy to solve is the following: take the logarithm. Instead of working with likelihoods, in practice, we’ll almost always use log-likelihoods because of their mathematical convenience. (Log-likelihoods are also easier to work with numerically because instead of very small positive numbers near zero, we can work with big negative numbers.) Because the logarithm is monotonic (it doesn’t change the ranks of numbers), the maximum of the log- likelihood is also the maximum of the likelihood. So here’s the mathematical magic:
In the equations above, I have used several properties of the logarithm: log(1/x) = -log x, log(xy) = log(x) + log(y), and log(ex) = x. This formula for the log-likelihood might not look much better, but remember that we are trying to find the parameters that maximize this function. To do so, we want to take its derivative with respect to the parameters and set it equal to zero. To find the MLE of the mean, p, we will take derivatives with respect to p. Using the linearity of the derivative operator, we have
Since two of the terms have no dependence on p, their derivatives are simply zero. Taking the derivatives, we get
where in the last step I took out of the sum the o2 that didn’t depend on i. Since we can multiply both sides of this equation by o2, we are left with
which we can actually solve by
This equation tells us the value of p that we should choose if we want to maximize the likelihood. I hope that it is clear that the suggestion is simply to choose the sum of the observations divided by the total number of observations—in other words, the average. I have written pMLE to remind us that this is the maximum likelihood estimator for the parameter, rather than the parameter itself.
Notice that although the likelihood function (illustrated in Figure 4.1) depends on both parameters, the formula we obtained for the pMLE doesn’t. A similar (slightly more complicated) derivation is also possible for the standard deviation:
In the MLE for the standard deviation, there is an explicit dependence on the mean. Because in order to maximize the likelihood, the derivatives with respect to all the parameters must be zero, to get the MLE for the standard deviation, you need to first calculate the MLE for the mean and plug it in to the formula for the MLE of the standard deviation.
In general, setting the derivatives of the likelihood with respect to all the parameters to zero leads to a set of equations with as many equations and unknowns as the number of parameters. In practice, there are few problems of this kind that can be solved analytically.