# INTEGRATING DATA SOURCES WITH SEQUENCE AND EXPRESSION CLUSTERS

A good illustration of the conceptual power of probabilistic modeling is that models for seemingly disparate data, like gene expression (which are high-dimensional, positive, real numbers) and sequences (which are discrete strings of letters) can be combined. For example, a classic genomics problem is to find genes that have similar expression patterns that also share sequence features (such as transcription factor binding sites) in the noncoding DNA flanking the gene. Using the clustering methods described in Chapter 4 (hierarchical or ?:-means), you first cluster the expression data, and then in a second step try to find sequence patterns in the noncoding DNA (perhaps using MEME or another motif-finding method). However, with a probabilistic model, it's possible to combine the two types of data into a single model (Holmes and Bruno 2000). We might imagine grouping genes into clusters both based on their expression patterns *and* based on the presence of similar sequence motifs flanking the genes. Let's say the expression data is *X* and the sequence data is *Y,* a model for both is

where

N(X) is a Gaussian model for the gene expression measurements

*M(Y)* is a MEME-like mixture model for the all the *m* w-mers found in the flanking noncoding DNA sequences, where each cluster is associated with its own sequence model, *f _{c},* and mixing parameter ф

_{с}

In this model, we are saying that the expression and sequence data are dependent, that is, *P(X, Y)* is not equal to *P(X)P(Y),* but that *given* that we know what cluster they are truly from, then they are independent. In other words, the expression data and the sequence data depend on each other *only* through the hidden variable that assigns them to a cluster. Slightly more formally, we are saying

where I have omitted the products over i.i.d. datapoints to highlight the dependence structure of the model. Notice the effect that this will have when we try to differentiate the expected complete log-likelihood:

FIGURE 6.8 The Holmes and Bruno joint model for expression data and sequence motifs represented as a graphical model. (a) The structure of the model where all variables are explicitly represented. *Z* represents the hidden “cluster assignment” and *Q* represents the hidden “motif” variable. *X* represents the observed expression levels, and *Y* represents the observed DNA residues. (b) The collapsed representation where the mDNA residues for each gene are shown in a box and the *n *genes are shown in another box. (c) The structure of the model with the biological interpretation of each variable.

The terms involving the sequence data *Y* will be separate from the terms involving the expression data *X,* so it will be straightforward to derive the E-M updates for this problem as well.

It is instructive to consider the implied causality relationships in this model. To do so, let's use the graphical models representation introduced here. In this representation (Figure 6.8), the hidden variables *Z,* specify which cluster gene *i* belongs to. Then, given that cluster, expression data *X* and some hidden variables Q that represent the presence of motifs in the DNA sequence. These hidden Q variables specify whether a w-mer, *Y* at position, j, was drawn from the *kth* motif class or the background class. Remember that both expression data X, and w-mers *Y* are actually multivariate observations. In the model, we didn't specify what the "cluster" or "Z" represents, but as we look at the structure, it is tempting to interpret the "clusters" as "regulators" that lead to both patterns of gene expression and appearance of binding sites in DNA. I hope this example illustrates the power of the graphical models representation: The structure of the model is much easier to see here than in the formula for the joint distribution.