CLUSTERING AS DIMENSIONALITY REDUCTION
Confronted with very high-dimensional data like gene expression measurements or whole genome genotypes, one often wonders if the data can somehow be simplified or projected into a simpler space. In machine learning, this problem is referred to as dimensionality reduction. Clustering can be thought of a simple form of dimensionality reduction. For example, in the case of К-means, each observation can be an arbitrarily high-dimensional vector. The clustering procedure reduces it to being a member of one of the К clusters. If arbitrarily, high-dimensional data can be replaced by simply recording which cluster it was a part of, this is a great reduction in complexity of the data. Of course, the issue is whether the К clusters really capture all of the interesting information in the data. In general, one would be very happy if they captured most of it.
1. The human genome has about 20,000 genes. Initial estimates of the number of genes in the human genome (before it was actually sequenced) were as high as 100,000. How many more pairwise distances would have been needed to do hierarchical clustering of human gene expression data if the initial estimates had turned out to be right?
FIGURE 5.10 Graph-based clustering and data visualization with Cytoscape. (a) Shows a representation of protein interaction data as a network or “graph.” Each node (gray ovals) in the network represents a protein, and each edge (thin lines) represents a reported interaction (so that multiple connections are possible between any pair of nodes). Clustering of this network using a graph-based clustering method (MCODE, Bader and Hogue 2003) identified 40 clusters. Dashed circle indicates one of the clusters identified. (b) Zoom-in view of a cluster identified shows the large number of reported interactions between these genes.
- 2. How many distance calculations (as a function of the number of datapoints and clusters) are needed in each iteration to perform the К-means clustering procedure?
- 3. Online algorithms are iterative procedures that are performed one observation (or datapoint) at a time. These can be very useful if the number of observations is so large it can’t be stored in the computer’s memory (yes this does happen!). Give an example of an online К-means algorithm. (Hint: Consider the “learning signal” provided by a single observation.)
- 4. Assuming two proteins are 50% identical, what is the expected dependence of the protein sequence distance based on the BLOSUM matrix on sequence length?