eQTLs: A VERY DIFFICULT MULTIPLE-TESTING PROBLEM
As a final example of how important multiple testing can be in practice, let’s consider one of the most complicated statistical problems in modern molecular biology: the search for expression quantitative trait loci or eQTLs. The goal of this experiment is to systematically find associations between gene expression levels and genotypes. For example, in a classic eQTL experiment in yeast (Brem and Kruglyak 2005) the expression levels for 1000s of genes were measured in more than 100 segregants of a cross of a BY strain with an RM strain. They also genotyped the segregants at 1000s of markers covering nearly the whole genome. eQTLs are genetic loci (or markers, usually represented by single nucleotide polymorphsism [SNPs]) that are associated with gene expression levels. For example, there was a very strong eQTL for AMN1 expression just upstream of the AMN1 gene, so that strains with the RM allele had much higher gene expression levels than those with the BY allele (Figure 3.2, top left). This is exactly the sort of thing that we’re looking for in an eQTL study, as it strongly suggests that there’s an important genetic difference between the two strains that controls expression of the nearby gene.
As obvious from the histogram, there’s a very strong statistical difference between these two expression patterns. Since the expression levels look reasonably Gaussian, we can go ahead and do a t-test, which gives a P-value around 10-30, strong evidence that the genotype and expression level are not independent.
However, AMN1 might not be the only gene whose expression is linked to this locus. Sure enough, DSEl’s expression level is also very strongly linked to the SNP near the AMN1 locus, and as you can guess from the
FIGURE 3.2 Multiple testing in a yeast eQTL experiment. The top two panels show the distribution of gene expression levels for AMN1 (left) and DSE1 (right) colored by the genotype at Chromosome 2, position 555787. The bottom panel’s left panel shows the distribution of P-values for t-tests to identify eQTLs for 10,000 random gene expression level, marker combination. The bottom right panel shows the number of tests needed to discover eQTLs for either one position in the genome (1 marker), all positions in the genome (all markers) or all pairs of genotypes (all pairs of markers).
histogram, it’s highly statistically significant (Figure 3.2, top right). And this is the great thing about the eQTL experiment: we’ve measured expression levels for all ~5800 yeast genes. So we can go ahead and test them all against the genotype at this locus, it’s only 5800 tests: Clearly, these will still be significant even after a stringent multiple-testing correction. However, in general, we don’t know which will be the important loci to look at: remember, we also have the genotypes at ~3000 loci. We need to try them all; now we’re up to more than 17 million tests. The P-values for
10,000 of those WMW tests (randomly chosen from the 17 million) are shown in the bottom left panel of Figure 3.2. You can see that like the differentially expressed genes, there are already a large number of tests piling up in the bin for P < 0.05, so that some kind of FDR correction will be probably be appropriate here.
In general, however, geneticists don’t believe that genotypes affect quantitative traits independently: typically, the expression level of a given gene is expected to be affected by multiple loci in the genome (e.g., genetic differences in transcription factor binding sites in the promoter, as well as genetic differences in the transcription factors that bind to them). What we would really like is to test for associations between combinations of genotypes and the expression levels. I hope it’s clear that doing this naively even just for pairs of loci will yield more than 10 billion tests, not to mention the computer power need to actually compute all the tests (Figure 3.2, bottom left). Obviously, smarter methods to search for these types of effects are needed.
Finally, to add one more layer of complexity to the eQTL picture: So far, we’ve been considering only searching for differences in gene expression level that are associated with genotype. In fact, it’s quite possible that the variability in gene expression levels or covariance between expression levels of one or more genes is actually under genetic control: This means that we could consider an even larger number of phenotypes, derived from the single-gene measurements. I hope that it’s no surprise that development of statistical methods for eQTL analysis has been an area of intense research interest.