DERIVING THE E-M ALGORITHM FOR THE MIXTURE OF GAUSSIANS
The E-M algorithm is a very general strategy to derive iterative optimization procedures to maximize the likelihood. The key idea of the E-M algorithm is that we will not even try to find analytical formulas for the parameters by solving Э/Э0 log! = 0. Instead, we will replace the likelihood function with an easier function to optimize, where some of the unknown variables have been fixed. We will then alternate between maximizing the function (M-step) and filling in the unknown variables with our best guess or “expected” values (E-step). The amazing result (which is beyond the scope of this book) is that if the objective function that we maximize is chosen cleverly, we will be guaranteed to also maximize the likelihood at each iteration.
The alternative function that we will optimize for the purposes of this book is the so-called “expected complete log-likelihood” following Michael Jordan’s interpretation of E-M (Jordan and Jacobs 1994). To write down this function we first define the “complete likelihood,” namely, the likelihood of the data if we knew which cluster each datapoint belonged to. This complete likelihood, CL, for the mixture model is
where once again the first formula is the general case for all mixture models, and the second formula is for the mixture of Gaussians. To write this formula, I used the indicator variable Z that has the value 1 if the ith datapoint belonged to the cth cluster and 0 otherwise. Notice the effect this indicator variable has: All the terms in the product that correspond to clusters to which the ith datapoint does not belong will be raised to the power of 0 (they will be set to 1). The terms that correspond to the cluster to which the ith datapoint belongs will be raised to the power of 1 and will be included in the product. Again, the difference between this “complete” likelihood and the actual likelihood is that to write this formula, we need to know the “Z” indicator variables that tell us which cluster every datapoint belongs to.
Although this new objective function might not look much simpler, we can now proceed to take its logarithm and expectation over the hidden variables
where in the second step we use the linearity of the expectation operator to move it inside the sum. Given what you know from the previous sections about derivatives and logarithms of Gaussian distributions, you probably can already guess that it’s now possible to solve Э/Э0 (log CL) = 0 and optimize the expected complete log-likelihood. For example, we will start with the mean parameters
First, notice that I didn’t write the sum of all the clusters in this formula: the derivative with respect to the mean of the kth cluster implies that all the terms that don’t depend on pk will have zero derivatives and therefore not matter for the optimization. So the terms related to all the other clusters won’t matter for the kth mean. Similarly, in the second step, I haven’t bothered to write the rest of the terms of the (log) Gaussian distribution because they don’t depend on pk and therefore will disappear. To simplify the derivation, we will use the matrix calculus trick that (9/9x)[xTAxj = xT(A + AT), where A is a matrix, and the derivative is with respect to a whole vector, x. Because the covariance matrix is symmetric, it equals its transpose, so we have
As before, we can multiply each term of the sum and the 0 vector by the covariance matrix, which gives.
This can be solved to get the update formula for the mean I gave above. To derive the formula for the mixing parameters, we have to take into account the constraint that they have to sum to 1. We add this to the objective function using a Lagrange multiplier. The constraint can be written
pc = 0, so the expected complete log-likelihood with the con-
Since the Gaussian portion of the objective function doesn’t depend on the mixing parameter, its derivatives will be zero, as will the terms in the sums that don’t depend on the kth mixing parameter. We get
which we can solve to give
Remember that we will get a similar formula for each of the K-mixing parameters. To figure out what X is, we substitute the update equation into the constraint
which means that X = n. Notice that in going from the third to the fourth formula (starting from the left) I used the fact that the sum of the (Z) over the clusters must also sum to 1. This is because each datapoint must be fully assigned; it’s just that we don’t know which cluster it belongs to. Another way to look at it is that we imagine that the datapoint must have belonged to one of the Gaussians.
Deriving the updates for the covariance involves the same sort of matrix calculus tricks we saw in Chapter 4. Overall, although the derivation is very similar to what we have seen for MLE, remember, the “M” or maximization step does not actually maximize the likelihood function. Instead, we are maximizing a simpler objective function, the expected complete log-likelihood. The magic is that by repeatedly maximizing the expected complete log-likelihood and filling in the hidden variables with their expectations (the “E” step) we are always increasing the likelihood. By running this process until convergence, we are guaranteed to find a maximum in the likelihood.
Another important comment here is that for the mixture of Gaussians we were not able to derive closed form solutions to maximize the likelihood, but we were able to derive analytic solutions for the M-step. For more complicated models, even this might not be possible. We might only find a formula that increases (not maximizes) the expected complete log- likelihood at each iteration, or instead we might give up on formulas and only numerically maximize (or increase) the expected complete log-likelihood. The E-M algorithm also permits us to include analytic updates for some of the parameters, and mix these with numerical updates for others where we can’t solve the equations. Sometimes, these kinds of algorithms will be called “generalized” E-M or sometimes just E-M (even though they might not actually be doing “M” at the M-step). Amazingly, they will all work: we will be guaranteed to be increasing the likelihood at each step.
To find the update for the E-step, we use Bayes’ theorem to fill in the expectation of the hidden variables. We have
The first equality uses the fact that the expectation for a Bernoulli variable is just the probability of the positive outcome, and the second is Bayes’ theorem. To simplify this formula, we note that for all the terms relating to data- points other than i, P(X | Zik = 1,0) = P(X | 0), so those terms will cancel out:
where the last step is the formula given above for the E-step.