GRAPH-BASED CLUSTERING: "DISTANCES" VERSUS "INTERACTIONS" OR "CONNECTIONS"
The clustering methods and examples that we’ve discussed so far all assume that we can define a distance measure in a high-dimensional space that is quantitatively meaningful (e.g., 5.1 is less than 5.3, which is less than 5.5). However, often we’re in a situation where the quantitative value of distance we calculate doesn’t really reflect anything interesting. In that case, we might simply want to know if two datapoints are “close” or “far.” In this case, pairwise distances are reduced to simply 1 (for close) and 0 (for far). We can think of these as “interactions” or “connections” between datapoints rather than “distances,” but any distance can be turned into an interaction or connection simply by placing a threshold on the distance. Once we begin to think of datapoints as connected, it’s natural to think of the data forming a network or “graph” (the mathematical term for what biologists call a network). Figure 5.9 shows how pairwise distances can be turned into interactions, and then into a network or graph. Special clustering and data visualization methods (that I won’t cover in this book, but are implemented in popular bioinformatics softwares such as Cytoscape,
FIGURE 5.9 Distances can be converted into interactions or connections. (a) Data in a two-dimensional space. (b) The matrix of pairwise distances between the datapoints. By applying a threshold (in this case, 0.3) to define “close” or interacting datapoints, which convert the distance matrix to a matrix of 0 and 1, shown in (c). Distances along the diagonal are omitted by convention, but they would all be 1 because each datapoint is always similar to itself. (d) Data can then be visualized by drawing lines connecting interacting datapoints. Graph clustering can then be applied to find the densely connected subgraphs within the interaction network, which is indicated as a dashed circle.
Shannon et al. 2003) can be applied to these graphs to find the highly connected regions. An advantage of the representation as binary interactions is that only the connections matter (the 1s in the matrix), and we don’t need to bother keeping track of the datapoints that are not connected (the 0s in the matrix). This has an intuitive elegance: Clustering is about finding groups that are close together, so we shouldn’t waste time analyzing datapoints that are far apart—we know these won’t be in the same cluster. Indeed, because of this simplification, graph-based clustering algorithms can be much faster than UPGMA (which considers all possible datapoints no matter how far they are from those already in the cluster). Graph-based clustering algorithms can find thousands of clusters in datasets of millions of points.
A major bioinformatics motivation for this scenario comes from efforts to develop automated methods to detect orthologs (homolo- gus proteins related only through speciation events) based on protein sequence (such as orthoMCL, Li et al. 2003). Although distances between biological sequences can be calculated (as described), these distances depend strongly on, for example, the length of the protein (see exercises) or protein-specific rate of evolution. These are not necessarily informative about the orthology relationships. For homology detection, we simply want to know whether two proteins are more similar to each other than two random sequences (with the same residue composition). Thus, a cutoff is applied, and proteins are scored as similar or not. (In orthoMCL, the cutoff is actually on the P-value against the null hypothesis of random sequences, not the sequence score.) This leads to a pairwise distance that’s either 1 or 0, which means we can think of two proteins as connected if they are more similar than expected from random sequences. This defines a “homology graph” where proteins are connected in the graph if they share statistically significant sequence similarity. Orthologous groups of proteins tend to fall into clusters in this representation, where they are connected to most of the other orthologous proteins, and have few connections to other sequences. OrthoMCL uses a graph-clustering approach (called MCL, Enright et al. 2002) to identify more than 100,000 clusters in a graph containing more than 1 million protein sequences and 500 million pairwise connections: This is a 1 million x 1 million matrix with 500 million entries that are 1s.
Another application of graph-based clustering is to find clusters in protein interaction networks (Bader and Hogue 2003). In this case, usually the data lends itself to representation as binary interactions either for experimental reasons (the experiment was a yeast-two-hybrid) or conceptual simplicity (we want to think of proteins as simply interacting or not, and not consider multiple cell types, conditions, etc.). For example, I clustered and visualized some protein interaction data from UniProt (UniProt Consortium 2011) using Cytoscape (Figure 5.10). An example of a cluster that I identified is shown (Figure 5.10). This cluster contains seven highly connected proteins—they share many more interactions than a typical pair of proteins. A quick gene set enrichment analysis (see Chapter 2) shows that six of these proteins are members of the HOPS complex, which is involved in vacuolar vesicle fusion. (In this case, six of seven is highly significant even after Bonferroni correction, see Chapter 3.) Thus, this known protein complex could have been discovered automatically from the pair wise interaction data. As you can imagine, given that it’s possible to turn any high-dimensional data into pairwise interaction data, graph- based clustering is a popular and intuitive clustering approach that scales well to very large data.