# IDENTIFYING MOTIF MATCHES IN DNA SEQUENCES

A classic bioinformatics problem is to find examples (sometimes called instances) of weak patterns (called motifs) in DNA (or protein) sequences. For example, based on known positive examples (training set), we estimate a model for the specificity of a transcription factor. This model gives the probability of observing DNA bases at each position in the known positives. So if we have a pattern of width w, we will have a matrix whose entries give the probability of observing each of the four bases (A, C, G, and T) at each of the w positions in the motif. Training examples of known DNA binding sites are shown in Table 10.1.

We then go ahead and estimate the parameters for the motif probability matrix:

Given the motif model, we want to find new examples of the binding site in a long DNA sequence. To think of this as a classification problem (illustrated

TABLE 10.1 Binding Sites from Promoters of GATA Regulated Genes in the SCPD Database

Gene |
Binding Site |

YIR032C |
GATAAG |

YIR032C |
GGTAAG |

YIR032C |
GATAAG |

YJL110C |
GATAAT |

YKR034W |
GATAGA |

YKR034W |
GATAAC |

YKR039W |
GATAAG |

YKR039W |
GATAAC |

*Source:* Zhu, J. and Zhang, M.Q., *Bioinformatics,* 15(7-8), 607, 1999.

FIGURE 10.6 Motif matching as a classification problem. At each position in the DNA sequence, we have a w-dimensional datapoint (w = 6 in this example). Our goal is to classify whether each position is an example of the motif or not. In the naive Bayes framework, this is done by computing ? (У|Х), assuming that the dimensions (the six positions in the motif) are independent.

in Figure 10.6), our goal is to classify each subsequence of length *w* in a long sequence as either an example or instance of this motif (positive) or a background sequence (negative).

Like many classification problems in bioinformatics, this training set consists only of positive examples of known binding sites. We are not given any sequences where we know there is no binding by the transcription factor. However, we can use the biological intuition that the vast majority of genomic sequences will not be bound by the transcription factor and assign the negative class a "background model," say, g, which we assume has the genome average probability of A, C, G, and T at each position.

To classify a new position *i* as a motif instance or not, we calculate the likelihood of the sequence *X _{i}* under the motif model (as in Chapter 6) for the positive class and under the background model for the negative class

To derive the naive Bayes classification boundary, we need to take the log ratio of these and compare to the ratio of the priors:

In practice, the log ratio of the priors is often referred to as a "cutoff" or "threshold," say, t, and typically a different cutoff is chosen for each motif model, *f.* We then identify motif matches at each sequence in the genome where

For ease of interpretation, the cutoff, t, can always be converted back into the prior using this formula.

For example, a cutoff of 4 corresponds to a prior of (1/e^{4} + 1) = 0.018. If we search DNA with a cutoff of 4, this means that (before seeing the data) we believe that less than 2% of the positions that we test correspond to examples of the motif.

Because this model is typically used to scan very long DNA sequences, the log ratio of parameters for each base *b* at each position *j* is often precomputed and stored as a "position specific scoring matrix" or PSSM. This matrix has the form

When written this way, the naive Bayes motif search procedure can be written as the (Frobenius) inner product of the PSSM with the datapoint. We should classify a position as a motif when