Once the distance metric is chosen, there are still many different possible clustering strategies that can be used. I’ll first describe the one most commonly used in bioinformatics and genomics applications—agglomerative hierarchical clustering. The clustering procedure starts by assigning each datapoint to its own cluster. Then, the clusters are searched for the pair with the smallest distance. These are then removed from the list of clusters and merged into a new cluster containing two datapoints, and the total number of clusters decreases by 1. The process is repeated until the entire dataset has been merged into one giant cluster. Figure 5.1c illustrates the first three steps of hierarchical clustering in the simple example. Figure 5.1d illustrates a completed hierarchical tree from this data.
There is one technical subtlety here: How to define the distances between clusters with more than one datapoint inside them? When each datapoint is assigned to its own cluster, it’s clear that the distance between two clusters can be simply the distance between the two datapoints (using one of the distances defined earlier). However, once we have begun merging the clusters, in order to decide what is the closest pair in the next round, we have to be able to calculate the distance between clusters with arbitrary, different numbers of observations inside them. In practice, there are a few different ways to do this, and they differ in the interpretation of the clusters and the speed it takes to compute them. Perhaps the simplest is the so-called “single linkage” where the distance between two clusters is defined as the smallest distance between any pair of data- points, where one is taken from each cluster. However, you might argue that the distance between clusters shouldn’t just reflect the two closest observations—those might not reflect the overall pattern in the cluster very well. A more popular (but a little bit harder to calculate) alternative is the so-called “average linkage” where the distance between two clusters is defined as the average distance between all pairs of datapoints where one is taken from each cluster. These distances are illustrated in Figure 5.2. When average linkage is used to do agglomerative hierarchical clustering, the procedure is referred to as Unweighted Pair Group Method with Arithmetic Mean (UPGMA), and this is by far the most common way clustering is done. As long as the clustering is done by joining individual observations into groups and then merging those groups, the process is referred to as “agglomerative.” If, on the other hand, the clustering approach starts with the entire pool and then tries to cut the dataset into successively smaller and smaller groups, it is known as divisive.
At first, the whole agglomerative clustering procedure might seem a bit strange: The goal was to group the data into clusters, but we ended merging all the data into one giant cluster. However, as we do the merging, we keep track of the order that each point gets merged and the distance separating the two clusters or datapoints that were merged. This ordering defines an “hierarchy” that relates every observation in the sample to every other. We can then define “groups” by choosing a distance cutoff on how large a distance points within the same cluster can have. In complex data sets, we often don’t know how many clusters we’re looking for, and we don’t know what this distance cutoff should be (or even if there will be one distance cutoff for the whole hierarchy). But the hierarchical clustering can still be used to define clusters in the data even in an ad hoc way
FIGURE 5.2 Distances between a cluster and a datapoint. Three observations (X2, X3, X6) have been merged into a cluster (iii) and we now calculate the distance of a datapoint (X5) from the cluster to decide whether it should be added or not. In single linkage (a), we use the distance to the closest point; in complete linkage (c), we use the distance to the furthest point; while for average linkage (b), we calculate the average of the pairwise distances between the points in the two clusters. I have indicated the average using m(), and used i to index the points in cluster iii. Gray arrows indicate the vectors between which we are calculating distances. The Euclidean distance is indicated as a black line, while the angle between vectors is indicated by ф.
by searching the hierarchy for groups that “look good” according to some user-defined criteria.
Hierarchies are very natural structures by which to group biological data because they capture the tree-like relationships that are very common. For example, the ribosome might appear as a cluster of genes, but this cluster might have two clusters within it that represent the 60S particle and the 40S particle. Similarly, T cells might appear as a cluster of cells, but this cluster could be made up of several subtypes of T cells, such as the cytotoxic T cells, natural killer (NK) T cells, and helper T cells.
In the following example (Figure 5.3), I have hierarchically clustered both the genes and cell types: For the genes, this means finding the groups
FIGURE 5.3 Hierarchical clustering of the ImmGen data using GeneCluster 3.0. Hierarchical clustering with average linkage for 8697 genes (that had the biggest relative expression differences) in 214 cell types. Both genes and cell types were weighted using default parameters. The data matrix is displayed using a heatmap where white corresponds to the highest relative expression level, gray corresponds to average expression, and black corresponds to the lowest relative expression (gene expression was log-transformed and mean log-expression level was subtracted from each gene). Left shows a representation of the entire dataset. Right top panel shows a clear “cluster” where the distances (indicated by the depth and height of the dendogram) are small between genes and cells. Right bottom panel shows that this corresponds to a group of immunoglobulin genes that are highly expressed in B cells. The clustering results are visualized using Java TreeView.
in a 214-dimensional space (each gene has a measurement in each cell type), while for the cell types, this means finding the groups in an 8697-dimensional space (each cell type has a measurement for each gene). The tree (or dendo- gram) created for the genes and cell types are represented beside the data. In general, if possible, it’s always preferential to represent a clustering result where the primary data are still visible. Often, visualizing the primary data makes immediately apparent that clusters (or other structures in the data) are driven by technical problems: missing data, outliers, etc.