# HYPOTHESIS TESTING REVISITED: THE PROBLEMS WITH HIGH DIMENSIONS

Since we’ve agreed earlier in this chapter that what biologists are usually doing is testing hypotheses, we usually think much more about our hypothesis tests than about our objective functions. Indeed, as we’ve seen already, it’s even possible to do hypothesis testing without specifying parameters or objective functions (nonparametric tests).

Although I said that statistics has a straightforward generalization to high dimensions, in practice one of the most powerful and useful ideas from hypothesis testing, namely, the P-value, does not generalize very well. This has to do with the key idea that the P-value is the probability of observing something as extreme or more. In high-dimensional space, it’s not clear which direction the “or more” is in. For example, if you observed three genes’ average expression levels (7.32, 4.67, 19.3) and you wanted to know whether this was the same as these genes’ average expression levels in another set of experiments (8.21, 5.49, 5.37), you could try to form a threedimensional test statistic, but it’s not clear how to sum up the values of the test statistic that are more extreme than the ones you observed—you have to decide which direction(s) to do the sum. Even if you decide which direction you want to sum up each dimension, performing these multidimensional sums is practically difficult as the number of dimensions becomes large.

The simplest way to deal with hypothesis testing in multivariate statistics is just to do a univariate test on each dimension and pretend they are independent. If any dimension is significant, then (after correcting for the number of tests) the multivariate test must also be significant (Mardia et al. 1976). In fact, that’s what we were doing in Chapter 3 when we used Bonferroni to correct the number of tests in the gene set enrichment analysis. Even if the tests are not independent, this treatment is conservative, and in practice, we often want to know in which dimension the data differed. In the case of gene set enrichment analysis, we don’t really care whether “something” is enriched—we want to know what exactly the enriched category is.

However, there are some cases where we might not want to simply treat all the dimensions independently. A good example of this might be a time course of measurements or measurements that are related in some natural way, like length and width of an iris petal. If you want to test whether one sample of iris petals is bigger than another, you probably don’t want to test whether the length is bigger and then whether the height is bigger. You want to combine both into one test. Another example might be if you’ve made pairs of observations and you want to test if their ratios are different; but the data include a lot of zeros, so you can’t actually form the ratios. One possibility is to create a new test statistic and generate some type of empirical null distribution (as described in the first chapter).

However, another powerful approach is to formulate a truly multivariate hypothesis test: a likelihood ratio test or LRT for short.

Formally,

• • The observations (or data) are X1, X2, ..., Xn, which we will write as a vector X.
• • H0 is the null hypothesis, and H1 is another hypothesis. The two hypotheses make specific claims about the parameters in each model. For example, H0 might state that 0 = ф, some particular values of the parameters, while H1 might state that 0 Ф ф (i.e., that the parameters are anything but ф).
• • The likelihood ratio test statistic is -2 log((p(X|H0))/(p(X|H1))), where any parameters that are not specified by the hypotheses (so-called free parameters) have been set to their maximum likelihood values. (This means that in order to perform a likelihood ratio test, it is necessary to be able to obtain maximum likelihood estimates, either numerically or analytically.)
• • Under the null hypothesis, as the size of the dataset goes to infinity, the distribution of the likelihood ratio test statistic approaches a chi- squared distribution, with degrees of freedom equal to the difference in the number of free parameters between the two hypotheses (Mardia et al. 1976). In the case of the likelihood ratio test, at small sample sizes, the distribution might not be well approximated by chi-squared distribution, so for many common applications of the LRT there are corrections used.

The idea of the likelihood ratio test is that when two hypotheses (or models) describe the same data using different numbers of parameters, the one with more free parameters will always achieve a slightly higher likelihood because it can fit the data better. However, the amazing result is that (if the sample size is large enough) the improvement in fit that is simply due to chance is predicted by the chi-squared distribution (which is always positive). If the model with more free parameters fits the data better than the improvement expected by chance, then we should accept that model.

The likelihood ratio test is an example of class of techniques that are widely used in machine learning to decide if adding more parameters to make a more complex model is “worth it” or if it is “over fitting” the data with more parameters than are really needed. We will see other examples of techniques in this spirit later in this book.