# EXACT TESTS AND GENE SET ENRICHMENT ANALYSIS

Another important class of hypothesis tests that can be used to handle data with nonstandard distributions is related to Fisher’s exact test. These types of tests can be applied to continuous data by first discretizing the data. For example, we can classify the CD4 data into “high” expression

TABLE 2.1 Numbers of Cells with High or Low CD4 Expression Levels

 “High” (>256) “Low” (<256) T-cells 24 27 Other cells 17 146

cells and “low” expression cells by choosing a cutoff. For example, if we say cells with expression above 256 are “high” and cells with expression below 256 are “low” expression cells, we can form a two-by-two table (Table 2.1).

Note that there is a difference between T cells and the other cells: T cells are about 50% “high” expression, and all the other cells are less than 20% “high” expression. We now need to calculate the probability of having observed that big a difference in fraction (or ratio) under the null hypothesis that the two groups really have the same fraction (or ratio). A famous test for a table like this is Pearson’s chi- squared test, which computes a test statistic based on the observed and expected numbers in each cell. In Pearson’s test, however, the numbers in each cell have to be “large enough” so that the approximation for the null distribution starts to work. Traditionally, Pearson’s test was used because it was easy to compute by hand, whereas Fisher’s exact test was difficult (or impossible) to compute by hand. However, since we now use computers to do these tests, we can always use Fisher’s exact test on this kind of data, which makes no assumptions about the numbers in each cell.

Unlike traditional hypothesis tests, exact tests don’t really have a “test statistic” function that is computed based on the data. Instead, Fisher’s test assigns a probability to every possible configuration of numbers in the table under the null hypothesis, that the numbers in each row and column were randomly sampled from the total pool of observations (using a formula that I won’t reproduce here). To compute the P-value, one adds up all the configurations with probability smaller (more unlikely) than the observed configuration. Thus, the P-value is the probability of observing a configuration as unlikely (or more unlikely) under the null hypothesis. Computing this in practice requires a reasonably clever way to sum up the very large number of configurations, and modern statistics packages might do this in various ways.

A similar exact test is used very often in “Gene Set Enrichment Analysis” (Subramanian et al. 2005). In this context, one has identified a list of genes in a molecular biology experiment (a gene set) and wants to test whether the list of genes is random or whether the experiment has identified genes with specific biology associated (“enriched”). Initially, gene lists usually came from clustering of gene expression data, often from microarrays. Gene Set Enrichment Analysis was used to show enrichment of specific biological function for genes with similar expression patterns (with similarity measured in a high-dimensional space). We will return to this type of data analysis in Chapter 5.

Nowadays, Gene Set Enrichment Analysis is used on gene lists that arise from all types of data analysis. For example, we recently used the test in a bioinformatics project in my lab. We developed a new method to predict protein kinase-substrates based on amino acid sequences (Lai et al. 2012) and wanted to know whether the set of predicted substrates contained more previously known substrates than expected by chance. We predicted 46 Mec1 substrates, and of these 7 were already known to be substrates. Considering there were only 30 known substrates in the databases, and we analyzed 4219 proteins in total, we thought we were doing pretty well: Our list had 15% known substrates, while fewer than 1% of proteins are known to be substrates. A little more formally

• We have a sample, Xp X2,..., Xn, where each observation can either have a certain property or not, which we denote as “positives” (X; = 1) or “negatives” (X; = 0).

ZH

Xi if

X was a random sample from some finite pool, Y1, Y2,..., Ym, with

Zm

Yi positives total is given by the hypergeometric distribution

i=1

Zi=min(n,l)

PHYp (i| n,l ,m) = 1-

i=k

Zi =k -1

PHYP(i|n,l,m),where the last equality P( > kH0) = 1 — P(<kH0) is just used to make the calculations easier.

In practice, this type of exact test can usually only be calculated using a reasonably clever statistics package, because to calculate

in many examples requires computing very large factorials (e.g., 4219! = 1 x 2 x 3 x ••• x 4218 x 4219 is too large to be stored as a standard number on a computer).

In the example, we want to calculate the probability of getting 7 or more known substrates in 46 predictions, when there are 30 known substrates in the database of 4219 proteins. This works out to

Needless to say, this is a very small number, supporting the idea that the list of Mec1 predictions was very unlikely to overlap this much with the known substrates by chance. Thus, the gene set was “enriched” for previously known substrates.