# Multiple Testing

## THE BONFERRONI CORRECTION AND GENE SET ENRICHMENT ANALYSIS

Given the ease with which data can be collected and statistical tests can be performed with computational statistics packages, the problem in contemporary molecular biology is that we often have too many statistical tests that we might want to do. For example, after a high-throughput screen we are typically in the situation where we have obtained a list of genes from the experiment, and we have no idea what those genes have to do with our experiment. We can try to come up with a hypothesis or two, but it’s often more convenient to do systematic gene set enrichment analysis. We test for all possible enrichments of anything we can get our hands on: gene ontology (GO) categories, Munich Information Center for Protein Sequences (MIPS) functions, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, ChIP-seq compendia, phosphoproteomics databases, ... you get the picture. In bioinformatics jargon, these types of functional assignments to genes are known as “annotations.” Table 3.1 shows some examples of functional annotations.

Of course, this seems like a good idea, and we can often discover interesting biology this way. However, let’s think about the statistical situation here. As discussed in Chapter 2, we calculate an exact P-value using the hypergeometric distribution. This P-value is the probability of getting a gene list as enriched or more as the one we got under the null hypothesis that our gene list was randomly drawn from the pool of genes. Let’s say we want the significance threshold to be *P* < 0.05. This threshold means

TABLE 3.1 Some Common Annotations of Gene Function

Gene ontology (GO) |
Assigns biological process, cellular compartments, and molecular functions based on a variety of evidence. Annotations or “GO terms” are organized into a hierarchy, where most general terms at the top, and most specific terms are at the bottom. Terms at the bottom can be thought to “inherit” all the terms above them. For example, “MAP kinase phosphorylation,” would be below “protein phosphorylation” in hierarchy. |

Munich Information Center for Protein Sequences (MIPS) |
Comprehensive, consistent automatic annotation based on sequences and curation of protein interactions. |

Kyoto Encyclopedia of Genes and Genomes (KEGG) |
Most widely used for assignments of genes into biological pathways. Each pathway is either metabolism, genetic information processing, environmental information processing, organismal systems, human diseases, or drug development. Within these broad categories, pathways such as “amino acid metabolism” or “replication and repair” are constructed from gene products. |

Reactome |
Human genes are assigned into curated pathways. These are organized hierarchically into large pathway groups like “cell cycle.” Specific reactions like “Orc3 associates with Orc2” falls under cell cycle. |

that we want there to be a 1 in 20 chance of seeing a gene list this enriched (or more) under the null hypothesis (where there was truly no enrichment). Consider what happens if we now have 20 different annotations to test: although each one of them *alone* has a 1/20 chance of getting a P-value of *P* < 0.05, because we did 20 tests, *overall* we expect one of them to have *P* < 0.05 under the null hypothesis (when there is no enrichment!). More generally, for any P-value threshold, a, if we do *m* tests, we expect about *m* times a tests to pass it under the null hypothesis. Another way to think about this is to remember that the distribution of P-values under the null hypothesis is uniform between 0 and 1. 0.05 covers 1/20 of this distance. This means that if we randomly draw *m* P-values from the null distribution, we expect about *m*/20 will fall below 0.05.

You can see immediately that in the context of many tests (or multiple testing) there will be a problem with our conventional use of the *P*-value. The simplest solution to this problem is the so-called Bonferroni correction to the *P*-value. The idea here is to use what we know about the distribution of *P*-values under the null hypothesis to correct for the number of tests we’ve done.

Since we know the probability distribution of the P-values under the null hypothesis is uniform, we simply *rescale* the region that corresponds to a “significant” P-value to be 0.05 *taking into account* the fact that we have done *m* tests. The new P-value threshold is simply 0.05/m and now there will only be a 5% chance of observing a P-value that small under the null hypothesis. Alternatively, we can achieve the same effect by rescaling the P-values themselves or “correct” the P-values by simply multiplying by the number of tests: *A* P-value, *P,* becomes *m* x *P.* To keep the P-values as probabilities, any *P*-value that gets bigger than one is simply set to be one. Formally, the correction can therefore be written as

where min[] indicates the minimum of whatever is in the square brackets. Now, no matter how many tests we do, we will only have a 5% chance of observing a P-value below 0.05 under the null hypothesis. The Bonferroni correction is widely used in practice because it is so simple to calculate. The only tricky part is knowing exactly what number of tests you really did (this is not always strictly equal to the number of tests you actually performed using your statistics software).

To illustrate just how tricky multiple testing can be, let’s consider a realistic example. Let’s say you were given a gene list with five mouse genes that came out of genetic screen. Not knowing anything else about the genes, you look at the GO annotations and notice that three of the genes are annotated with the GO term GO:0001525, angiogenesis. Since there are only 218 other mouse genes annotated with this term out of 24,215 mouse genes, this looks like a dramatic enrichment: 60% of your list is associated with angiogenesis, and less than 1% of the mouse genes are. Indeed, the exact test described in Chapter 2 for gene set enrichment seems to support this: P_{HyP}(>3|218,5,24,215) = 7 x 10^{-6}.

However, when we did this test we had no hypothesis, so we would have accepted enrichment of any GO annotations. And since there are *m* = 16,735 different GO annotations assigned to mouse genes, the Bonferroni correction tells us we should multiply the P-value by 16,735. I have

This *P*-value does not pass a nominal 0.05 threshold after the Bonferroni correction. This means that if we choose random lists of five genes 20 times, we’re likely to see a list of five genes that has an enrichment as strong as what we observed for the GO term GO:0001525, angiogenesis.