# CLUSTERING DNA AND PROTEIN SEQUENCES

To illustrate the power of thinking about molecular biology observations in high-dimensional spaces, I want to now show that a nearly identical clustering procedure is used in an even more “classical” problem in evolutionary genetics and molecular evolution. Here, the goal is to hierarchically organize sequences to form a tree. In this case, the hierarchical structure reflects the shared ancestry between the sequences. Although the task of organizing sequences hierarchically is widely referred to as “phylogenetic reconstruction,” in a sense, reconstructing a phylogenetic tree is a way of grouping similar sequences together—the more similar the sequences in the group, the more recently they shared a common ancestor. To see the mathematical relationship between this problem and clustering gene expression data, we first need to think about the sequences as multivariate observations that have been drawn from a pool. In a sequence of length, L, each position, j, can be thought of as a vector Xj, where the *bth* component of *Xj* is 1 if the sequence has the DNA base *b* at that position. For example, the DNA sequence CACGTG would be

where the bases have been ordered arbitrarily from top to bottom as, A, C, G, T. A protein sequence might be represented as a 20 x *L* matrix. To compute the distance between two (aligned) gene sequences, *X _{g}* and

*X*we could use one of the distances defined earlier. For example, we could sum up the cosine distance at each position

_{h},where to derive the simple formula on the right, I have used the fact that the norm of the vector at each position in each gene sequence is 1 (each position can only have one 1, all the others must be 0s), and the inner product between two matrices is the sum of the dot products over each of the component vectors. In sequence space, this distance metric is nothing but *the number of different nucleotide (or protein) residues* between the two vectors and is a perfectly sensible way to define distances between closely related sequences. Notice that just like with gene expression distances, this simple metric treats all of the dimensions (types of DNA or protein residues) identically. In practice, certain types of DNA (or protein) changes might be more likely than others, and ideally these should be downweighted when computing a distance.

In practice, for comparison of sequences, evolutionary geneticists and bioinformaticists have developed so-called “substitution” or “similarity” matrices that are used to define the distance between sequences. Indeed, perhaps one of the lasting contributions of the earliest bioinformatics projects were Marget Dayhoff’s so-called PAM (Point Accepted Mutation) matrices (Strasser 2010). Changes between DNA (or protein) residues are not equally likely, and therefore bioinformaticists defined quantitative values related to the likelihood of each type of change to capture the idea that not all residues are equally far apart biologically. Figure 5.4 shows the BLOSUM62 substitution matrix, one of the matrices currently in wide use for protein sequence analysis. Taking into account these weights, the distance between two (aligned) sequences is then

where *M* is a “similarity” matrix defining a score for each pair of DNA (in this case) residues, *a* and b. Notice that to make the similarity into a distance, I simply put a negative sign in front. In the case where *M=I *(the identity matrix), the distance between two sequences is simply proportional to the number of different positions. This type of distance is directly analogous to the weighted distances discussed above. Once these distances have been defined, it is straightforward to find the “closest” or most similar pairs of sequences in the sample. The closest sequences are then iteratively merged as described earlier to build the bifurcating tree, and clustering of sequences can proceed just as any other data.

FIGURE 5.4 An example of similarity matrix used in bioinformatics, BLOSUM62 (Henikoff and Henikoff 1992). The 20 naturally occurring amino acids (and *X* and *, which represent ambiguous amino acids and stop codons, respectively) are represented. Higher scores mean greater similarity—entries along the diagonal of the matrix tend to be relatively large positive numbers.

Note that the agglomerative hierarchical clustering method described does not actually estimate any parameters or tell you which datapoints are in which clusters. Nor does it have an explicit objective function. Really it is just organizing the data according to the internal structure of similarity relationships. Another important issue with hierarchical clustering is that it is necessary to compute the distances between all pairs of datapoints. This means that if a dataset has thousands of observations, milllions of distances must be calculated, which, on current computers, is typically possible. However, if the dataset approaches millions of datapoints, the number of pairwise distance calculations needed becomes too large even for very powerful computers.