PRIOR DISTRIBUTIONS FOR CLUSTERING AND INFINITE MIXTURE MODELS
In Chapter 6, I introduced the Gaussian mixture model and suggested that choosing the model that minimizes the AIC is a good way to choose the optimal number of clusters. However, if you've managed to follow this discussion of regularization and prior distributions, you are now probably wondering if there is an elegant formulation of model-based clustering that (analogous to regularization in linear regression) will choose the number of clusters automatically.
One situation where prior distributions on the number of clusters, K, have been explored is in the context of Bayesian approaches to population structure. You probably remember from my discussion of the problem in Chapter 4, that assigning individuals to subpopulations is a beautiful example where the Bayesian idea of estimating posterior distributions matches our intuition about admixed individuals, and has led to the STRUCTURE program. STRUCTURE models genotype data using a finite mixture model (one component for each subpopulation) and treats the subpopulation assignment for each individual as a hidden variable. By fitting the mixture model, STRUCTURE can infer both the population structure and the ancestry of each individual. However, a major issue with inferring population structure from genotype data is that we don't usually know the number of suppopulations for the model: We have to choose K. In the STRUCTURE manual (Hubisz et al. 2009), the authors suggest using the MAP value of K. They suggest computing the probability of the data PdataK for several different values (say K, L, M), and then forming the ratio
which corresponds to a uniform prior over K. You're probably worried about this because the different models have different numbers of clusters, and therefore different numbers of parameters, so the probability of the data will always increase as K increases. And this would be true if the STRUCTURE was computing the likelihood, but, because STRUCTURE is a bona fide Bayesian method (as discussed in Chapter 4), it (hopefully) has managed to integrate out the parameters, so that the probability of the data for these different models can be fairly compared. Although this might work okay in practice for small K, it's akin to using the AIC to regularize linear regression. To use this formula, we have to actually compute the probability of the data for many different choices of K. It would be more elegant if we could formulate the model to learn the number of clusters along with the rest of the model, just as the regularized regression learned the number of b parameters that we should allow to be nonzero.
Indeed, there is just such a formulation, and it has been applied to the problem of choosing the number of hidden subpopulations in the Bayesian formulation of population structure inference (Pella and Masuda 2006, Huelsenbeck and Andolfatto 2007). However, before going further, I think it's important to consider that in linear regression, the idea of the prior distribution was to bias the parameter estimates to stay near 0. The number of parameters doesn't change; it's simply a matter of setting some of them to zero. In the case of clustering, the problem is much more complicated: If we do not prestate the number of clusters, there will be parameters in some models that do not exist in others (because there are a smaller number of clusters). Further, since we don't know the number of clusters we have to consider the possibility that there might be infinitely many: The number of clusters is a natural number (1, 2, 3, ...) Since in principle the number of clusters (and the parameters) in the model could be infinite (although this turns out to be impossible for any finite dataset), these models are sometimes called "infinite mixture models." The other name for these models is nonparametric Bayesian mixture models, which I believe is even more confusing because of the way nonparametric is used in other areas of statistics (as we used it in Chapter 2). So we will stick with the name infinite mixture models.
To understand the infinite mixture model, let's start with the likelihood for the (finite) mixture model:
where c indexes the cluster number, from 1 to K, and I've separated out the parameters describing the distribution of the data (0c) from each cluster from the mixing parameters (n) or prior probabilities. Naively, we might try to write the MAP objective function for the mixture of Gaussians as
where P(K|a) and P(0|K, v) are the prior distributions (or penalties) associated with the number of clusters and the other parameters. In these prior distributions, I've written additional a and v to emphasize that these are not constant, but have shapes to encourage the structure we would like to find. In principle, we could then try to optimize both the parameters and K directly, and choose the hyperparameters by cross-validation. In practice, this turns out not to be feasible (as far as I know).
However, it turns out that if the problem is written in terms of the (unobserved) cluster assignments for each datapoint, (the Zic in our notation from Chapter 6) it is possible to derive a sampling strategy to obtain samples from the posterior distribution (Rasmussen 1999). Perhaps most important for this to work was to figure out how to choose the prior probability of an observation from one of the clusters that was already in the model or whether a new cluster should be created. If there are already a few clusters in the model, and each has a lot of datapoints assigned, we would like the prior distribution (or regularization) to tend to assign the new data to one of the clusters we already have. On the other hand, if this is one of the first data points we've observed, we'd like to give it a good chance of starting its own cluster. Thus, the prior distribution we'd like is a "rich-get-richer" process where clusters that already have a lot of data tend to get more—only if the data is very different should we allow the model to create a new cluster (and associated parameters).
This model structure can be encouraged using a so-called Dirichlet process prior. The simplest way to generate data from this process is the so-called Chinese restaurant process, where we imagine a restaurant with an infinite number of tables (the clusters), each with an infinite number of seats. As each new customer (datapoint) enters the restaurant, that person chooses a table (cluster) based on the number of other people (datapoints) already seated there or starts a new table (cluster). This process can be controlled by a single parameter, a, which is the hyperparameter that controls the growth in the number of clusters as the amount of data increases. As the number of tables in the restaurant approaches infinity, the probabilities converge to
Note that in the top formula, ^ Zcc is just the number of datapoints
assigned to this cluster. You can see that these rules have the desired behavior: When there are few points assigned to a cluster, X, is almost as likely to start a new cluster as to join. On the other hand, if there are many points already in a cluster, X, is encouraged to join. Although the mathematics of the infinite mixture is beyond the scope of this book, the sampling strategy that results is relatively simple.