E-M UPDATES FOR THE MIXTURE OF GAUSSIANS
As with the K-means clustering, it should be clear that the objective function for the Gaussian mixture model is very high-dimensional (has many parameters) and therefore will in general be difficult to optimize. For example, in the bivariate case (typically used for illustration purposes, see Figure 6.4 for an example) even with only two clusters there are already 11 parameters, so brute-force exploration of the likelihood surface is already impractical. However, as with K-means, it is possible to derive an iterative procedure (algorithm) that is guaranteed to increase the objective function at each step. In this case, because the objective function is the likelihood, if the algorithm reaches the maximum of the objective function, we will have maximum likelihood estimates (ML estimates) of the parameters. It’s an example of the E-M algorithm, a very general learning
FIGURE 6.4 Clustering using Gaussian mixture models (Fraley et al. 2012) on the CD4 and CD8 antigen expression levels from ImmGen. Black circles identify the means identified by the mixture of Gaussians and ellipses illustrate the covariance matrices. Each gray symbol indicates one cell type, and the shape of the symbols indicates the cluster to which it is most likely to belong to. Notice that because the covariances of the clusters are different, cell types are not necessarily most likely to be assigned to the closest mean (as with fc-means). (a, b) show the convergence of E-M assuming 4 clusters, and (c, d) show the convergence with 5 clusters. (a, c) Show a Gaussian mixture model with full covariance, while (b, d) show the Gaussian Mixture model with diagonal covariance. As with the К-means example (in Chapter 5) the Gaussian Mixture model has a difficult time identifying the small CD4+CD8+ cluster with К = 4, preferring instead to break up the large, dispersed CD4-CD8- into two clusters. With К = 5 it does seem to correctly identify the small cluster.
(or inference) algorithm for probabilistic models with hidden variables. The general procedure outlined for deriving the E-M algorithm can be applied to a large class of probabilistic models with hidden variables.
Because the derivation of the E-M algorithm is a bit long, I will start by giving the answers. We start by “assigning” each datapoint to a cluster.
Since the variable Z represents the assignment of each datapoint to a cluster, our best guess to which cluster the datapoint belongs is the expectation (or average) of the hidden variable, Z. This is the so-called “E-step” or “expectation-step” of the E-M algorithm. Because we have interpreted the indicator variable as a random (albeit hidden) variable, we assign a probability distribution to this variable and compute its expectation, which we will indicate as (Z). The angle brackets are a notation for averages or expectations widely used in physics.
Again, the first formula is for the general case of the mixture model, and the second is for the mixture of Gaussians. Another way of looking at the E-step is that we “fill in” or infer the values of the hidden variables by assigning them to their expected values. Because in this case, the hidden variables are 0 or 1 indicator variables, computing their expectations corresponds to inferring the (posterior) probability that the observation i belongs to the cluster k (the expectation of a 0 or 1 variable is just the probability of observing 1).
After assigning datapoints to clusters by filling in the hidden variables, we re-estimate the parameters to maximize the objective function. This is the so-called maximization or “M-step.” For the mixture of Gaussians, we have the following formulas:
This is the “update equation” for the mean of the Gaussians. Notice the similarity of this equation to the formula for the means of К-means. We have a weighted average, where the weights are the expected assignments of each datapoint to each cluster. We also need a formula to calculate the mixing parameters, or the relative fractions of each Gaussian.
The interpretation of this formula is simply that it’s the expected number of datapoints assigned to the cluster k, divided by the total number of data- points. This is just the “expected” fraction of the datapoints that are assigned to each cluster. Finally, we need a formula for the covariance matrices:
which again can be interpreted as a weighted version of the standard formula for the maximum likelihood estimations (MLEs) of the covariance. Once we’ve re-estimated the parameters at the M-step, we just go back to the E-step and fill in the hidden variables again.
Thus, the implementation of E-M for the mixture of Gaussians is just a matter of calculating these four formulas, which in many cases is quite straightforward using a package like MATLAB® or R. The algorithm alternates between filling in the expected assignments for each datapoint and then optimizing the parameters of the clusters and the mixing parameters. Eventually, the assignments and parameters stop changing and the algorithm has “converged” to a local maximum in the likelihood.
Although the E-M algorithm is guaranteed to converge to a local maximum in the likelihood, there is no guarantee that a mixture model will have only one local optimum. Typically, several initializations of the parameters are chosen and the algorithm is run to convergence. Sometimes, the initialization will be done based on another clustering algorithm, like К-means to make sure that the initial guesses are not too far off. After a few different initializations are tried, the likelihoods under the final parameters for each initialization are compared, and the best parameters are chosen as the final answer.
In general, the E-M algorithm for any mixture of distributions is relatively simple to implement. The formulas for the hidden variable and mixing parameter are the same as mentioned earlier, and the updates for the parameters are typically weighted versions of the standard MLEs for the parameters. This is one of the reasons that it is a favorite optimization algorithm for mixture models. Note that this does not imply that E-M is the “best” or “fastest” way to optimize the likelihood—in most cases it’s neither. But in many cases, it is the simplest optimization strategy to implement, so if you have a probabilistic model that you want to fit, and you can derive E-M updates, this is probably where you want to start.