APPLICATION OF LINEAR REGRESSION TO eQTLs
To illustrate the power of simple linear regression for hypothesis testing in modern molecular biology, let’s consider a very difficult statistical problem: detecting eQTLs. In Chapter 3, we saw that eQTLs can be identified using a simple t-test that compares the distribution of expression levels for one genotype to the other. The null hypothesis is that the two genotypes have the same expression, so we find eQTLs when we reject it. (We also saw that this leads to a formidable multiple testing problem, with which simple regression doesn’t help us; but see the discussion in Chapter 9.) Here, I just want to show an example of how regression can be used as a test of association in place of the t-test we did before. When we formulate the search for eQTLs as a regression problem, we’re regressing a quantitative measure of gene expression (Y) on the genotypes of each individual (X). Although the genotypes at each locus are discrete (AA, Aa, aa in diploids or A and a in haploids), this isn’t actually a technical problem for linear regression because the Gaussian model applies only to the Y variable; the Xs are assumed to be perfectly observed.
The main advantage of regression in this case is interpretability: when we calculate the correlation between the phenotype (the gene expression level) and the genotype at each locus, the R2 summarizes how much of the variation in the expression level is actually explained by the genotype. In addition to the biological interpretation of the test based on R2, regression also turns out to be more flexible—it will generalize more easily to multiple dimensions and more easily accommodate differences in ploidy in the experiment.
For example, although it might not look like the typical illustration of linear regression shown earlier, the expression level of AMN1 is strongly correlated with the genotype at a marker on chromosome 2 in yeast. The correlation, r, is -0.847, but note that the sign (positive or negative) is arbitrary: We assigned the reference genotype to be 0 and the mutant to be 1. We could have done it the other way and gotten a positive correlation. The associated t-statistic is -16.2714, with df = 104, which, as you can imagine, is astronomically unlikely to occur under the null hypothesis where the correlation is actually 0 (the P-value is less than 10-10). In addition, we can go ahead and square the correlation, which gives R2 = 0.718, suggesting that the genotype at this locus explains almost three quarters of the variance in AMN1 expression. Figure 7.3a shows the data.
We can go ahead and perform tests like this systematically to identify eQTLs in the data. I chose 10,000 random pairs of genes and markers and computed the correlation coefficients and associated P-values. In Figure 7.3b, you can see that the distribution of P-values has an excess of small P-values indicating that there are probably a large number of significant associations. To see what the typical effect size is, I plotted the R2 for all the tests that were significant at 0.05 after a Bonferroni correction (bottom panel in
FIGURE 7.3 A simple regression of gene expression levels on genotype. (a) The dependence of the relative gene expression level of AMN1 in an yeast cross on genotype. Each circle represents a segregant of the cross. (b) The distribution of P-values obtained from tests of association between genotype and gene expression level. (c) R2 values associated with tests that passed the Bonferroni multiple testing correction. Typical R2 values are much less than what is seen for AMN1.
Figure 7.3b). Even though these associations are very strongly significant (remember that the Bonferroni correction is likely to be very conservative), the typical effect sizes are much less than what we saw for AMN1. For these examples, the genotype explains more like 20% or 25% of the variation in gene expression levels. Thus, although tests for eQTLs can be performed using a variety of statistical techniques (t-tests, WMW, etc.), summarizing the statistical association using the fraction of variance explained gives us the insight that AMN1 is probably quite unusual with respect to how much of its gene expression variation is controlled by a marker at a single locus, even among the strongest eQTLs in the experiment.