# EXAMPLE OF A HIGH-DIMENSIONAL MULTIPLE REGRESSION: REGRESSING GENE EXPRESSION LEVELS ON TRANSCRIPTION FACTOR BINDING SITES

To illustrate an elegant application of multiple regression in genomics, let’s try to build a statistical model to predict genome-wide expression levels from DNA sequences. A key idea in molecular biology is that gene expression levels are due in part to the presence of consensus binding sites for sequence-specific DNA binding proteins in the noncoding DNA adjacent to genes. But how much of genome-wide expression patterns do these binding sites really explain?

We can use multiple regression to try to answer this question starting with the simplest possible assumptions. Let’s let the gene expression level for a gene, g, be *Y _{g}* in a given condition. If we know the binding specificities (or motifs) of transcription factors (e.g., from in vitro binding experiments), we can count up the number of times each of these sequence patterns occurs nearby each gene. In the regression model, the number of matches to each binding motif nearby the gene is the

*Xs*that we will use to try to explain Y, the expression level. The regression model is

where for each transcription factor we have a parameter b, and *X* are the numbers of matches to the binding motif for each transcription factor (TBP, Pho4, Msn2, etc). Notice that as with the eQTLs, the Xs in this model are not well-modeled by Gaussian distributions. The number of matches nearby each gene can be 0, 1, 2, ... Again, we won’t worry too much about this because the regression model assumes that there is Gaussian noise only in the Y, and that the Xs are measured perfectly. We’ll use the log ratios of gene expression changes in this model, which are real numbers (like -0.24, 1.4345, etc.), so we expect the Gaussian noise assumption to work reasonably well.

It is interesting to think about the biological interpretation of the vector, b, in this model. If the component of *b* corresponding to Msn2 is 0, the presence of that binding site does not predict anything about expression levels in the condition in which the experiment was conducted, perhaps suggesting that the transcription factor is inactive. On the other hand, a positive *b* means that genes with that sequence pattern nearby tend to be relatively overexpressed, perhaps suggesting that the transcription factor tends to activate gene expression. Similarly, a negative *b* might indicate that a transcription factor is a repressor, leading to lower levels of gene expression. Since the model is *linear,* we are assuming that all transcription factors that happen to bind near a gene combine independently to determine its gene expression, and that the impact that a transcription factor has on that gene’s expression is proportional to the number of binding sites for that transcription factor nearby. Of course, these assumptions are very naive, but the point here is that if we make these assumptions it becomes straightforward to actually test the model *globally and quantitatively.*

If we go ahead and do this multiple regression, we can compute the R^{2}, which tells us how much of the total variation in gene expression measurements can be explained under the simple additive assumption.

For example, if we try regressing gene expression changes due to growth in low phosphate conditions on the transcription factors mentioned earlier, we find that all three have b’s that are significantly greater than 0 (activate transcription?), and that together they can explain about 3.7% of the total variation in gene expression due to low phosphate conditions (Figure 8.4). Now, 3.7% might not sound like a lot, but when interpreting this we have to consider a few things: First of all, how much of the variation could *possibly* have been explained? The data I used were from early microarray experiments (Ogawa et al. 2000) that undoubtedly contain a lot of noise: The R^{2} between biological replicates is only about 13%, so the model might account for more than 1/4 of the variance that could have been explained. Whenever you are building a statistical model, it’s always good to keep in mind that the explainable variance might be much less than the total variance in the data: If your model starts fitting the data better than the agreement between replicates, this probably means that you’re overfitting. The second thing that it’s important to remember is that we only used three transcription factors to explain *all* the gene expression changes. In fact, the cell has hundreds of transcription factors and many of them are active at any given time.

It’s well known that transcription factors can work together to encode so-called “combinatorial” logic. For example, it might be important to have both the TBP and Pho4 binding sites in the noncoding DNA in order

FIGURE 8.4 Predicted and observed gene expression data. Predictions or fitted values of the regression model, ?[Y|X] (vertical axis) compared to the observed values, *Y* (horizontal average), for three multivariate regression models.

to achieve full repression of gene expression. In regression, this type of logic can be included as an “interaction” between two covariates (or features) by defining a new covariate that is the product of the two covariates. For example, in this case the regression might be

where I have included the coefficient *b _{interaction}* for the interaction term. Conveniently, we can test whether there is statistical evidence for the interaction by testing if

*b*is different from zero. It’s important to notice that including these kind of interactions increases the number of parameters in the model, so that measures like R

_{interaction}^{2}will always increase as more such interactions are included in the model. Indeed, including interaction terms in the phosphate gene expression model does increase the R

^{2}by about 1%, but the interaction terms are not significant. With a large number of covariates the number of possible interactions becomes very large, and trying to include them all creates a very bad multiple-testing problem. Finally, you should notice that including interactions in this way assumes that the interaction is proportional to the product of the covariates. In practice, this might not be true.