IS THE CLUSTERING RIGHT?
UPGMA and other hierarchical clustering methods are ubiquitously used in molecular biology for clustering genome-scale datasets of many kinds: gene expression, genetic interactions, gene presence and absence, etc.
Perhaps the biggest conceptual problem molecular biologists encounter with agglomerative clustering is that it doesn’t tell you whether you have the “right” or “best” or even “good” clusters. Let’s start with the question of how to decide if you have good clusters. For example, let’s say you don’t know whether to choose average linkage or single linkage clustering—so you try both. Which one worked better?
Ideally, you would be able to decide this based on some external information. For example, if you knew for a few of the data points that they belonged in the same or different groups, you could then ask for each set of clustering parameters, whether these datapoints were clustered correctly. In molecular biology, this is very often possible: If one is clustering genes based on some genomic measurements, a sensible way to evaluate the quality of the clustering is to perform gene set enrichment analysis, and choose the clusters that give (on average, say) the most significant association with other data.
However, what about a situation where you don’t know anything about your data in advance? In fact, there are sensible metrics that have been proposed over the years for summarizing how well a dataset is organized into clusters, even without any external information. I will discuss just one example here, known as the silhouette (Rousseeuw 1987). The silhouette compares how close a datapoint is to the cluster that it has been assigned to, relative to how close it is to the closest other cluster (that it has
FIGURE 5.5 The silhouette for a single data point. In the left panel, two clusters are indicated (i and iii). The distances between a single data point (X3) to the data- points in the cluster it is assigned to are indicated by black lines, while the distances to the datapoints in the nearest cluster are indicated by gray lines. The average of these distances (a3 and b3, respectively) is shown on the right. The silhouette is the ratio of the difference of these averages to their maximum.
not been assigned to). Figure 5.5 illustrates the idea. The average silhouette over a whole dataset for a given clustering method measures how good the clustering is overall.
Possibly more important (and more difficult) than asking whether the clustering is good is whether a datapoint belongs to one cluster (as opposed to another) or whether there are two groups in the data or just one. Clustering is meant to be exploratory data analysis and therefore doesn’t really have a strong framework for hypothesis testing. However, these questions can become very important if you are using clustering to discover biological structure in the data. For example, if you are clustering tumor samples based on gene expression, you might have grouped the samples into two clusters, and then you want to know whether really there is statistical evidence for the two subtypes. The distribution of the silhouette for the datapoints in each cluster can help decide if the cluster is really justified. For example, if the average silhouette of the points in a cluster is very small, this means that the data in the cluster are just as close to their cluster as to the neighboring cluster. This indicates that there wasn’t really a need for two separate clusters.
It’s also possible to test whether the pattern you found in the data using clustering is robustly supported by bootstrapping. The idea of bootstrapping is to randomly resample from the dataset (as if it was the pool that observations were originally drawn from) and then rerun the clustering. Typically, bootstrapping is done by leaving out some of the dimensions or some of the datapoints. After this has been done thousands of times, we can summarize the confidence in specific aspects of the clustering results by calculating the fraction of bootstrapped samples that supported the clustering we observed. For example, if 99% of the samples showed the same two clusters in the data as original analysis, we can conclude that our clustering result is unlikely to depend on random properties of input dataset, but probably reflects real structure in the data.
TECHNIQUES TO EVALUATE CLUSTERING RESULTS
- • Comparison with external data: If you know the true clusters for some of your data, compare your clustering result with that.
- • Even if you don't know the true clusters, you can try to maximize the statistical association of your clusters with some external data (such as GO annotations if you are clustering genes).
- • Statistical measures of purity or homogeneity of the cluster such as silhouette: Use a metric that lets you quantitatively compare the clustering results under different parameter settings.
- • Bootstrap your results: Bootstrapping gives you confidence that the data uniformly support the clustering conclusions.