IDENTIFYING TRANSCRIPTION FACTOR BINDING MOTIFS USING THE REDUCE ALGORITHM
Interestingly, the original application of this model (now a classic bioinformatics paper) was to consider the case where the consensus transcription factor binding sites are unknown. In this case, it's possible to try to do the regression using all possible short DNA sequence patterns (say all DNA 5-mers, 6-mers, and 7-mers) as "potential" consensus binding patterns, and test them all. Because the linear regression model has analytic solutions, it is no problem to do these calculations on a computer. However, if you test enough possible DNA sequence patterns, eventually you will start to find some that can explain some of the gene expression levels by chance. In this case, it's very important to think about multiple-testing corrections to decide which of the DNA sequence patterns are correlated with the expression data more than you would expect by chance.
In the original REDUCE paper (Bussemaker et al. 2001), the authors applied the linear regression model to many gene expression experiments and obtained measures of statistical associations for each short DNA sequence pattern for each experiment. They then had to choose a few of the most significant patterns to include as they built up the model. They used an iterative approach where they asked if each new motif explained more additional variance than would be expected for a random motif.
What if we didn’t know which transcription factors were important for the response to low phosphate conditions in this example ? We could try to include all of the transcription factors in a database in our model. For example, here are the p-values that I got from regressing gene expression on motif matches for ~200 transcription factor motifs from the YETFASCO database (De Boer and Hughes 2012, Figure 8.5).
Based on this distribution, I would guess that there are about 20 or 25 transcription factors that are contributing significantly to the model. At first, it seems reasonable that about this many transcription factors would be involved in the genome-wide expression changes in response to low phosphate. However, one important consideration is that not all of the transcription factor motifs are independent. For example, in the 3-transcription factor model, Msn2 had b = 0.056, meaning that for every match to the Msn2 motif, the expression of that gene was about 5.7% higher. However, if we now include Gis1 to make a 4 transcription factor model, although all 4 transcription factors are still highly significant, we find that the parameter for Msn2 is reduced to b = 0.035. It turns out that the binding motifs for these 2 transcription factors are very similar: When a DNA sequence has a match for Gis1 it very often also has a match for Msn2. Indeed, the number of matches for Gis1 and
FIGURE 8.5 Distribution of P-values for the multiple regression of gene expression levels on motif matches ~200 transcription factors. This model explains about 14.5% of the variance of the gene expression data, which is probably more than the total amount of explainable variance in the data: The model is overfit.
Msn2 are highly correlated (R2 = 0.26). The regression model can use either the Gis1 matches or the Msn2 binding matches or both to explain the expression levels.
As we have seen in Figure 8.3, correlations between features (or covariates) are typical in biology, especially when the number of features is large. In cases where the number of features is large, including multiple correlated features can usually improve the fit to the model slightly because the features are not usually perfectly correlated, so the model can always fit a little bit more of the noise, or use the two variables to compensate for the data deviations from the linear assumption. In principle, you could try to test all of the correlated features using partial correlation analysis (described earlier). However, because it works on pairs of variables, it’s not really practical to apply it to a large number of covariates. In general, when building a high-dimensional model, one has to be cautious about including correlated features, because the tendency of regression models is to add more parameters and improve the fit.