CHOOSING THE NUMBER OF CLUSTERS FOR K-MEANS
One important issue with К-means is that, so far, we have always assumed that К was chosen beforehand. But in fact, this is not usually the case: Typically, when we have 1000s of datapoints, we have some idea of what the number of clusters is (maybe between 5 and 50) but we don’t know what the number is. This is a general problem with clustering: The more clusters there are, the easier it is to find a cluster for each datapoint that is very close to that datapoint. The objective function for К-means will always decrease if К is larger.
Although there is no totally satisfying solution to this problem, the measures for quantitatively summarizing clusters that I introduced can be used when trying to choose the number of clusters. In practice, however, because К-means gives a different result for each initialization, the average silhouette will be different for each run of the clustering algorithm. Furthermore, the clustering that produces the minimum average silhouette might not correspond to the biologically intuitive clustering result. For example, in the case of CD4 and CD8 expression examples, the silhouette as a function of cluster number is shown in Figure 5.8. The average silhouette reaches a maximum at К = 3, which is missing the small CD4+ CD8+ cluster. Since the missing cluster is small, missing it doesn’t have much effect on the silhouette of the entire data—the same reason that it’s hard for К-means to find this cluster in the first place.
Unfortunately (as far as I know), there is no simple statistical test that will tell you that you have the correct number of clusters. In Chapter 6,
FIGURE 5.8 Choosing the number of clusters in the CD4 CD8 example using the average silhouette. For К-means (gray symbols), 100 initializations were averaged for each cluster number, and error bars represent the standard deviation of the average silhouette for each initialization. The optimal number of clusters is 3. For PAM (К-medoids, unfilled symbols), each initialization converged to the same exemplars, so there was no variation in the average silhouette. The optimal number of clusters is 4.
we’ll see how to do clustering using probabilistic methods where the objective function is a likelihood. In that case, there are more principled ways of trading off the model fit to the number of parameters. In particular, we’ll see that the AIC is an effective way to choose the number of clusters. Choosing the “right” number of clusters is such a difficult and pervasive issue that this alone is sometimes enough to persuade researchers to stick with hierarchical clustering, where this choice is sort of hidden. Another approach is to give up on the idea that there is a single true number of clusters and use Bayesian statistics to estimate the distribution of cluster numbers (we will return to this idea in Chapter 9). However, regardless of the solution you choose, this is something that will have to be discussed in any cluster analysis.