So what about P-values? What I actually took away from my introductory statistics courses was that somehow if an experiment was done properly, I could plug the numbers into a formula and it would spit out a very small P-value, which meant the experiment “worked.” If the P-value wasn’t that small, then the experiment didn’t work—that was bad, because it meant that you probably had to do everything again. Although it’s easy to understand why students get this impression, this is completely wrong.

P-values are the result of a so-called “statistical hypothesis test.” For example, the t-test is specifically a test of whether the mean of two lists of numbers are different. In “statistical hypothesis testing” (the technical term for performing t-tests and other tests), the hypothesis that you are always testing is the so-called “null hypothesis.” It’s very important to realize that the “null hypothesis” that you are testing statistically is not usually the same as your actual (scientific) hypothesis. In fact, it’s often exactly the opposite of the scientific hypothesis. In the case of the t-test, the null hypothesis is that the two lists of numbers truly have the same mean. Usually, the scientific hypothesis is that the two lists of numbers (e.g., describing mutant and wildtype) are different. Comparing lists of numbers is by far the most common use of statistical test in molecular biology, so we’ll focus on those kinds of tests for the rest of this chapter. We’ll discuss other hypothesis tests as they arise throughout the book.

Traditionally, hypothesis tests (such as the t-test) are generally set up in the following way. Define a “test statistic” (some function of the data) whose distribution is known under the null hypothesis, and look up the observed value of this test statistic in this so-called “null distribution” to test the probability that the data were drawn from the null hypothesis. Since finding a function of the data whose distribution was known (and possible to calculate by hand) was very unusual, inventing these tests was a great achievement. Statistical tests were therefore usually named after the distribution of the test statistic under the null hypothesis (the null distribution) and/or after the statistician who discovered the test statistic (and sometimes a combination of the two). So the null distribution for the (Student’s) t-test is the Student’s t-distribution because the test was proposed by a guy who called himself “Student” as a pseudonym.

Using the “rules of probability,” we can write all this a bit more formally:

• • H0 is the null hypothesis, which can be true or not.
• • The observations (or data) are Xp X2, ..., XN, which we will write as a vector X.
• t is a test statistic, t=f(X), where f represents some defined function.
• • The null distribution is therefore P(t|H0 is true).
• • t* is the observed value of the test statistic (for the set of data we have).
• • The P-value is P(t is “as or more extreme” than t*|H0 is true), or P(t > t*|H0) in short.

Given these definitions, we also note that the distribution of P-values is also known under the null hypothesis:

• P(P-value < p|H0 is true) = p

In other words, under the null hypothesis, the P-value is a random variable that is uniformly distributed between 0 and 1. This very useful property of P-values will come up in a variety of settings. One example is “Fisher’s method” for combining P-values. Given several different tests with P-values p1, p2, ..., pn, you can combine them into a single P-value.

Zi=n

ln (pi)

i=1

has a known distribution if p1, p2, ...,pn are uniformly distributed {0, 1} and i.i.d. (It turns out to be approximately chi-squared distributed with df = 2n.) This type of test (a test on the P-values of other tests) is also called “meta-analysis” because you can combine the results of many analyses this way.

To illustrate the idea of hypothesis testing, let’s take a real example based on gene expression data from ImmGen (Heng et al. 2008). Stem cells are expected to be rapidly dividing, so we might expect them to express genes involved in DNA replication, like Cdc6. Other cells, since they are differentiated and not duplicating their DNA, might be expected to show no (or low) expression of DNA replication genes like Cdc6 (Figure 2.4, top left). The null hypothesis is that stem cells and other cells show the same average Cdc6 expression level. Notice that this null hypothesis was actually the opposite of our biological hypothesis. We can do a t-test (on the expression levels from stem cells vs. the expression levels from other cells) and calculate a P-value to test the hypothesis. In this case, the t-statistic is -5.2979. If we look-up the probability of observing that t-statistic value or more in the null distribution (the known distribution of the t-statistic given the null distribution is true, which turns out to be a P-distribution), we get a P-value of 0.0001317. This means that the chances of stem cells actually having the same average expression levels are very small, which is great! This means that our data reject the null hypothesis, if the assumptions of the test are true.

And this brings up the caveat to all P-value calculations—these P-values are only accurate as long as the assumptions of the test are valid. Each statistical test has its own assumptions, and this is just something that you always have to worry about when doing any kind of data analysis. With enough data, we can usually reject the null hypothesis because we will have enough statistical power to identify small deviations between the real data and the assumptions of our test. It’s therefore important to remember that in addition to the P-value, it’s always important to consider the size of the effect that you’d identified: If you have millions of datapoints, you might obtain a tiny P-value for a difference in mean expression of only a few percent. We’ll revisit this when we test for nucleotide content differences on chromosomes in Chapter 4.

The t-test assumes that the data are normally distributed (Figure 2.4, right panel), but turns out to be reasonably robust to violations of that assumption—the null distribution is still right when the data are only approximately normally distributed. The t-test tests for differences in the mean. Consider the data for CD4 antigen gene expression (also from ImmGen) in Figure 2.4. Clearly, the CD4 expression levels in T cells are more likely to be high than in the other cells, but the distribution is strongly bimodal—clearly not Gaussian (compare to the theoretical distribution in Figure 2.4). Although this probably does violate the assumptions of the t-test, the more important issue is that only testing for a difference in the mean expression level is probably not the best way to detect the difference.

FIGURE 2.4 Testing hypotheses in practice with gene expression data from ImmGen. In each plot, the fraction of cells is plotted on the vertical axis as a function of the gene expression level on the horizontal axis. Real (Cdc6 and CD4) expression patterns (top left) differ from what is expected under a Gaussian model (top right). In the case of Cdc6, there is still a clear difference in the means, so a f-test would be just fine. In the case of CD4 (bottom left), there is a bimodal distribution, and a nonparametric test would probably work better.