Although mixtures of Gaussians aren’t usually used in practice to cluster genome-scale expression datasets, mixture models have proven very effective at modeling another type of important molecular biology data: sequences. Because sequence data is not continuous, usually Gaussian models are not appropriate. We’ll consider two popular bioinformatics applications that are based on mixture models, however, in both cases they will turn out to be mixtures of other distributions.

The MEME software (Bailey and Elkan 1995), for identifying “motifs” or patterns in DNA and protein sequences, is based on a two-component mixture model. In this case, the two “clusters” are (1) a family of short sequences that share a similar pattern, and (2) all the rest of the “background” sequences. Amazingly, MEME can effectively separate out the short sequences that share a pattern without needing to know what the pattern is in advance: The pattern is the parameters of the cluster that are estimated by E-M.

MEME starts by choosing a “motif width” w, which will be the length of the pattern it will discover. The input DNA or protein sequence (say of length n) is the decomposed into subsequences of length w, “w-mers.” These w-mers can then be treated as i.i.d. observations drawn from a pool where some of the w-mers are members of the motif cluster, and the others are members of the background cluster. As in any clustering problem, we don’t know which of the w-mers belong in which cluster ahead of time. Once again, we can write down the likelihood of the data:

In writing the objective function for this mixture model, I didn’t include a sum over the mixture components, because in this case there are only two: either a w-mer is in the motif cluster, or in the background class. Instead of writing the sum, I explicitly wrote out the two components, and gave the motif cluster the parameters f and mixing parameter ф, and the background cluster parameters g and mixing parameter l-ф. Once again, we will think of each position, j, in the w-mer as a vector Xj, where the bth component of Xj is 1 if the sequence has the DNA base (or protein residue) b at that position. For example, the w-mer TCCACC would be

where the bases have been ordered arbitrarily from top to bottom as, A, C, G, T. A protein sequence would be represented as a 20 x w matrix.

To write the probability of a w-mer under the motif or background model, MEME assumes that

where f is a matrix of parameters for the motif cluster specifying the probability of observing base b at position j. The matrix of probabilities, f, defines that pattern of sequences that would be expected if we were to draw w-mers that were all from the motif family. Similarly, the matrix g defines the w-mers that we would expect from the background pool. For example, a parameter matrix for the motif might look like

Thus, MEME is based on a mixture model, but it is not a mixture of Gaussians. Instead, it is a mixture of categorical (or sometimes called multinomial) probability models that were introduced in Chapter 4. Let’s take a minute to consider what the formula for the categorical probability model says. For example, it gives

Since any number raised to the power of 0 is 1, we get

So the formula says that the probability of a sequence is simply the product of the probabilities of the bases we observed, which I hope you’ll agree makes sense.

The clever representation of sequences as matrices of 0s and 1s will also make the objective function look relatively simple for this model. Starting with the complete log-likelihood

where I have introduced Q to represent the hidden variable determining which of the two clusters (motif, m or background, g) to which the w-mer, X, belongs (see Figure 6.6). Substituting the discrete probability model mentioned earlier and taking the log and the expectation we get

We can now go ahead and differentiate to obtain the E-M updates for this model.

In this model, the positions, i, where Q{ = 1 are the unobserved “true” positions of the motifs. Although we don’t observe these, we fill them in

Representation of a DNA sequence, X, using a mixture model

FIGURE 6.6 Representation of a DNA sequence, X, using a mixture model. Hidden variables Q indicate whether each position is drawn from the motif (1) or background (0). Here, the DNA sequence is decomposed into w-mers, which we treat as multivariate observations.

with their “expectations” (Q). If we want to know the likely positions of the motif instances, we should look for the position in the sequence where (Q) approaches 1. However, the model parameters, f that we estimate are perhaps of even more interest. These represent the sequence preferences of the transcription factor at each position: this is the pattern or “motif.”

Here, I have described the simplest version of the MEME model, where all of the data is treated as a simple two-component mixture. However, in practice sequence data might have more structure. For example, we might have 28 bound genomic regions identified in a ChIP-exo experiment: in this case, we might want to find a pattern that occurs once in each sequence fragment (one occurrence per sequence model, oops). In the MEME software, several such model structures have been implemented (Bailey 2002).

A second example of popular bioinformatics software that is based on a mixture model is Cufflinks (Trapnell et al. 2010). RNA-seq experiments aim to detect the mRNA abundance based on the number of short reads mapping to that gene relative to all others in an RNA sequence library. However, in practice, in complex eukaryotic genomes, the reads obtained from each gene might result from several different alternative splice forms. The expression of a gene should be related to the abundance of all of the transcripts that the gene can produce, rather than the average number of reads overlapping all the exons. Usually, however, we aren’t sure which splice forms are expressed in a given experiment, and many reads could represent multiple transcripts from the same gene. Cufflinks deals with this uncertainty by treating the transcript as a hidden variable, such that the alignment of reads that we observe is explained by a mixture, where each component (or class) corresponds to one of the known transcripts for that gene. Specifically, we can write a probabilistic model for the alignment of reads as

As usual, I have used Z to represent the hidden variable that tells us which transcript, t, is responsible for the read, R. Here, we assume that we know the number of transcripts, K, in advance. Notice that this model corresponds exactly to the mixture model, where the number of reads in alignment i is a mixture of reads from all the transcripts.

Similar to the equation for the mixture of Gaussians, the first part of the likelihood represents the mixing parameter, in this case, the probability that it was transcript t that generated the read R. If we assume that reads are generated simply proportional to the amount of RNA present, the probability of observing a read from this transcript is just the fraction of the RNA in this transcript. The amount of RNA in each transcript depends both on the abundance of that transcript, pt, and its length lt.

In this model, we assume that we know the lengths of each transcript (based on the genome annotation) but that the abundances are unknown— in fact, these abundances p are exactly the parameters that we would like to use the model to estimate.

The second part of the likelihood is the probability of observing a read given that we know which transcript it came from. In practice, there can be several attributes of the read that we consider when we compute this probability (Figure 6.7). Most obvious is whether the reads map only to exons that are actually contained in this transcript: the probability of generating a read mapping to exons that are not found in this splice form must be zero. Similarly, if a portion of the read hangs over the edge, it is unlikely that the read was generated from this transcript.

Factors influencing the probability that a read was part of a transcript. (a and b) Represent the exons (gray bars) of two alternative transcripts at

FIGURE 6.7 Factors influencing the probability that a read was part of a transcript. (a and b) Represent the exons (gray bars) of two alternative transcripts at

a locus. (c) Shows an aligned read (black line) that could have been generated by either transcript. (d) An aligned read that almost certainly came from transcript (a). (e) A paired-end read that could have come from either transcript, but is more likely to come from transcript (a) because the distance between the paired reads implied by transcript (a) is more consistent with the typical space between paired-end reads (dashed line).

Given that a read falls within a transcript, the simplest model is to assume a uniform probability of observing a read at each possible position in the transcript t:

where Wt represents the length of the read. Thus, in this simplest form, cufflinks assumes that the reads are drawn from a mixture of uniform distributions, with each transcript producing a slightly different uniform distribution.

In practice, cufflinks considers several other more subtle features: for example, for reads where we have sequenced two ends (paired-end reads), we can consider the implied length of the DNA fragment that produced the pair, based on the mapping of the two paired reads in this transcript (Figure 6.7). If this transcript implies a DNA fragment length that is much larger or smaller than typical for our library, we should consider these read alignment less likely given this transcript. In this case, we can write

where I have now also indexed W by the transcript, t, and included some parameters 0 to describe the distribution of fragment lengths (we might know this in advance or want to estimate it). In practice, the probability that this transcript generated this aligned read might also depend on the transcript abundances, p, the mapping confidence of the reads, and other various biases in the sequencing experiment.

Given the specification of the model, it remains to estimate the transcript abundances, p, based on this mixture model. In practice, structure is imposed so that each locus is treated independently and the optimization of the model is performed using numerical methods. Cufflinks is unusual for a bioinformatics software in that it doesn’t compute a single estimate of p, but rather reports a Bayesian description of the distribution (obtained from a sampling procedure—a type of inference algorithm that we won’t cover in this book).

< Prev   CONTENTS   Source   Next >