K-means clustering is really the first technique that we’ll see in this book where the computer will actually “learn” something. In some sense, it is the most basic “machine learning method.” Note that K-means is sometimes called “c-means” in the statistical literature, but in the machine learning world, it’s definitely K-means. The idea of K-means is to take seriously the idea that there are underlying groups in the data, and that each of these clusters (exactly K of these clusters) can be summarized by its mean. So K-means aims to assign each datapoint to one of K clusters, based on the similarity of the data to the mean of the cluster.
In particular, K-means says: Assign each datapoint to the cluster to which it is most similar. Formally, if X is a series of data vectors, Xp X2, ..., Xn we can write this as the K-means cluster assignment rule:
Here, I have used some special notation. First, I have used d(m, X) to represent the distance between the mean for the cluster and the observation vector, X, but in traditional К-means clustering, the distance is always taken to be the Euclidean distance. К-means, with other distances, has other names (e.g., spherical К-means refers to К-means with the cosine distance described). Most important, however, is the variable Z that indicates which cluster observation i is assigned to. We define this variable to take on the values 1 or 0 (think of these as true or false). The rule says that Z is 1 if the mean, m, for cluster k is closer to Xi than any other of the other means. Make sure you understand what this “indicator variable” Z means—it is a key notational trick that will allow us to write some very fancy formulas.
But what are these “mean” patterns of the clusters? They are nothing more than the average patterns of all the data points assigned to them. We can write this as the К-means mean estimate:
Notice how cleverly I used the indicator variable Z in this formula. Although I said that this is nothing more than the average, you can see that the formula is more complicated than a normal average. First, I used the indicator variable to include (or exclude) the observations Xi that are (not) assigned to cluster k, by multiplying them by the indicator. Because the indicator is 1 when the observation is assigned to the cluster, those values are included in the sum, and because the indicator is 0 for the observations that are not assigned to the cluster, they are multiplied by
- 0 and therefore do not contribute to the sum. Instead of a normal average, where I would divide by the total number of observations, now I divided by the sum of the indicator variable: This adds 1 for each of the datapoints in the cluster, and 0 for any data that are not assigned to the cluster. Using this indicator variable, I can write a single formula to take the averages of all the clusters without having to define variables for the numbers of genes in each cluster, etc. It might all seem a bit strange, but
- 1 hope it will become clear that indicator variables are actually incredibly powerful mathematical tricks.
OK, enough about the indicator variable already! We’re supposed to be doing clustering, and how can this К-means idea work: If we don’t know the means, m, to start with, how do we ever assign data points to clusters (according to the cluster assignment rule)? And if we can’t assign data- points to clusters, then how do we ever calculate the means (using the means estimate)?
In fact, there are several answers to these questions, but the most common answer is: Start with random assignments of datapoints to clusters. Although this will produce a very bad set of means, we can then alternate between reassigning all of the datapoints to clusters and then recalculating the means for each cluster (Figure 5.6). This type of procedure is known as an “iterative” procedure, and nearly all of the machine learning methods
FIGURE 5.6 How the К-means algorithm “learns.” (a) Illustrates the initial (wrong) assignment (indicated by dashed gray lines) of observations to randomly chosen cluster means, m. (b) Shows that when the means are recalculated based on the observations, they will be pulled towards the majority of their datapoints. This way, datapoints that are far away from the cluster will be “left behind”. (c) Shows the reassignment (dashed gray lines) of observations to the means in the next iteration. (d) Finally, once the “right” datapoints are assigned, the means end up very close to the centers of the clusters.
we’ll consider will lead to these types of iterative procedures. The main idea is that the computer starts of knowing nothing—it takes a random guess at the parameters (in this case, the means of the clusters). But then, if the rules for updating the parameters are sensible, the estimates of the parameters will get better and better with each iteration—the machine will learn.
If we consider the objective function of К-means to be the sum of the distances between the datapoints and the mean of the cluster they’ve been assigned to, we can formalize the optimization problem as trying to minimize the function of the means, m.
In this formula, we have to sum over all the К clusters, so I’ve introduced c to index them. Notice again the clever use of the indicator variable Z to multiply all of the distances between the data and the clusters that they are not assigned to by 0, so they don’t contribute to the objective function.
It’s important to think about how complicated this function really is: Each of the К-means actually is a vector of length equal to the dimensionality of the data. In the case of the ImmGen data, this is a vector of length 200. Since we have К of these means, we might have К*200 parameters to find. Because the parameters are actually determined by the assignments of each datapoint to the clusters, really there are on the order of Kn possible combinations. This is an incredibly large number even for a small number of clusters, because there are typically thousands of datapoints.
The good news is that the iterative procedure (algorithm) described is guaranteed to decrease the objective function in each step—decreasing the distance between the cluster means and the datapoints is a good thing—it means the clusters are getting tighter. However, the algorithm is not guaranteed to find the global minimum of the objective function. All that’s guaranteed is that in each step, the clusters will get better. We have no way to know that we are getting the best possible clusters. In practice, the way we get around this is by trying many random starting guesses and hope that we’ve covered the whole space of the objective function.
To illustrate the use of the К-means algorithm on a real (although not typical) dataset, consider the expression levels of CD8 and CD4 in the immune cells in the ImmGen data. As is clear from Figure 5.7, there are four types of cells, CD8-CD4-, CD8-CD4+, CD8+CD4- and CD4+CD8+. However, this is a difficult clustering problem, first, because
FIGURE 5.7 К-means clustering of cell-types in the ImmGen data. Each black “+” indicates one cell type, and gray squares identify the means identified by the К-means algorithm. (a) Shows convergence to the “right” clusters with К = 4. (b) Shows the tendency of the К-means algorithm to split up the large cluster with low expression for both genes. (c) Shows that adding an extra cluster can help, but that the algorithm still splits up the big cluster for some initializations (d).
the sizes of the clusters are not equal. This means that the contribution of the large (CD4-CD8-) cluster to the objective function is much larger than the very small cluster (CD8+CD4+). In addition to the unequal sizes of the clusters, there are some cells that seem to be caught in between the two CD4- clusters. Because of this, for some random choices of starting cluster means, К-means chooses to split up the large cluster. This has the effect of incurring a large penalty for those datapoints (notice that they are consistently added to the CD8+CD4- cluster leading to a slight shift in the mean), but the algorithm is willing to do this because it is “distracted” by reducing the small penalties for the large number of datapoints in the large cluster. One possible solution is to try to just add more clusters—if we allow К-means to have an extra mean or two, then the algorithm can still split up the large cluster (—), and find the smaller ++ cluster as well. This is a reasonable strategy if you really want to use К-means to find the small clusters. However, it is still not guaranteed to work as there are still many more datapoints in the big (—) cluster, and its shape is not well- matched to the symmetrical clusters assumed by К-means.