# THE GAUSSIAN MIXTURE MODEL

К-means (which we considered in Chapter 5) is a very widely used clustering method because of its conceptual and algorithmic simplicity. However, it can’t be considered a statistical modeling technique because the models are not probability distributions. However, it’s quite straightforward to generalize К-means to a probabilistic setting. In doing so, we will finally have our first probabilistic machine learning method. To do so, we simply interpret the *К* cluster means of К-means as the means of multivariate Gaussian distributions (described in Chapter 4). Formally, this implies that each observation was now drawn i.i.d. from one of several Gaussian component distributions. The interesting part is that we don’t know which datapoints were drawn from which of these Gaussians, nor do we know the mean and variance parameters of these Gaussians. Amazingly, if we define the update rule (or algorithm) cleverly, even if we start with a random guess, the computer will be able to automatically learn (or infer) all of this from the data.

One major difference between the mixture of Gaussians and the К-means algorithm is that the datapoints will not be assigned absolutely to one cluster or another. Although in the probabilistic model we assume each datapoint is drawn from one of the component Gaussians that represent each cluster, we consider the “true” assignment to the cluster to be unobserved. Instead of making a so-called “hard assignment” to a cluster, we will assign each datapoint fractionally according to its probability of being part of each cluster. This type of “fuzzy” assignment allows datapoints that are really in between two clusters to be partly assigned to both, reflecting the algorithm’s uncertainty about which cluster it really belongs in.

As promised, the mixture of Gaussians is a probabilistic model, and its objective function is a likelihood function. Let’s start by writing it down:

In this formula, we start on the left at the general formula for the likelihood, define the К-component mixture model in general, and finally specify the Gaussian mixture in the last formula on the right. In a Gaussian mixture, each (possibly high-dimensional) datapoint *X,* is drawn from one of *K* Gaussian distributions, N(X|^, Z). I will use an indicator variable, *Z*, to represent the assignment to each component of the mixture: *Z _{ic}* = 1, if the ith observation is in the cth cluster and 0, if it is not. Notice that I have introduced a so-called “prior” probability

*P(Z*= 1|0) = n

_{ic}_{c}of drawing an observation from each class from the mixture. The mixture model says, first pick a cluster (or class) according to the prior probability, and then draw an observation

*X*from a Gaussian distribution

*given*that class. Because in the mixture model, the class for each datapoint is unknown, we sum (or “marginalize”) over the possible classes for each datapoint (hence the sum over c). In the standard formulation of the mixture of Gaussians, this prior is simply a multivariate Bernoulli distribution (also known as a discrete distribution or sometimes a multinomial) whose parameters are the “mixing parameters,” n (not to be confused with 3.1415... that occurs in the formula for the Gaussian probability). In order to make the prior a proper probability distribution, these mixing parameters must sum to 1,

*Z**K* % 1 *K*

*P(Z** _{ic}* = 1|9) =

*p*

*= 1. You can think of these mixing*

_{c}parameters as our expectation about the fraction of datapoints from each cluster.

In the context of probabilistic models, we will interpret the indicator variable as a random variable (just like our observations) but whose values we have not measured: We imagine that some truth exists about whether the *i*th observation from the pool belongs to the *c*th cluster, but we simply have not observed it (Figure 6.1). This type of variable is referred to as a “hidden” variable. Note that we have now increased the dimensionality of our data: instead of each datapoint simply being some observations *X,* each datapoint can now be thought of as having an observed component *X _{i}* and a “hidden” component

*Z*that specifies which cluster it was drawn from.

_{t}The interpretation of the cluster assignment as a hidden variable allows us to motivate the specific form that we chose for the mixture of model using the rules of probability. The mixture model can be thought of as writing the probability of the data *X* as

where I have left out the parameter vectors and the product over observations for clarity. In some sense, by adding the hidden variables we are making our problem more complicated than it really is: we have a dataset *X*, and rather than just analyzing it, we are artificially supplementing it

FIGURE *6.1* (a) The mixture model adds a hidden indicator variable Z, to each

of the observed variables, X. (b) A mixture model can be thought of as drawing observations from a series of pools. First, each an indicator variable, Z, is drawn from a pool that tells us which cluster (or class) each datapoint, X, belongs to. Depending on Z, *X* is assumed to have different parameters (i.e., depending on which cluster *X* belongs to). In a clustering situation, the *Z* is not observed—we have to “fill in” or infer their values.

with hidden variables and using mathematical tricks to keep it all ok. But in fact we are doing this for good reason: Our data X, has a very complicated distribution that we cannot model using a standard probability distribution. However, we believe that if we divide up our data into “clusters” (as determined by the hidden variable, *Z)* each cluster will have a much simpler distribution that we can model (in our case using Gaussians). Said another way, *P(X)* was too complicated to model directly, but the “class- conditional density” P(X|Z) has a simple form. Therefore, we have traded a difficult distribution of our data for a joint distribution of our data and some hidden variables whose conditional distribution we can characterize. Figure 6.2a illustrates the idea of trading of a single complicated distribution for two Gaussians in one dimension. Figure 6.2b shows an example of how a two-component mixture model can be used to fit single-cell RNA-seq data that we saw in Chapter 2 does not fit very well to standard probability distributions. Mixture models of this type are used in practice to analyze single-cell sequencing data (e.g., Kharchenko et al. 2014). As you can probably imagine, if the number of components in the mixture

FIGURE 6.2 Trading a complicated distribution for a mixture of simple conditional distributions. In (a), the graph on the left shows an asymmetric probability distribution that doesn’t fit well to any standard model. The graph on the right shows that the distribution on the left is actually an equal mixture of two very simple Gaussian distributions. In (b), the distribution of single-cell RNA- seq data (discussed in Chapter 2) is shown as gray bars. The dotted trace shows a two-component mixture model (mixture of two-thirds Poisson and one-third Gaussian) that does a reasonably good job of fitting the data. The parameters for each distribution are shown in the parenthesis.

gets large enough, it’s possible to represent arbitrarily complicated distributions using the mixture models. In fact, in the computer vision world, Gaussian mixture models with large numbers of components can be used to model everything from hand-written digits on bank cheques to the shapes of bacilli in micrographs of sputum.