 # EXAMPLE OF LRT FOR THE MULTINOMIAL: GC CONTENT IN GENOMES

The LRT is very general and powerful, but this also makes it a bit abstract. To illustrate the use of an LRT to test a specific multivariate hypothesis, let’s test for differences in GC content between human chromosomes. I counted the numbers of A, C, G, and T on the human X and Y chromosomes, and obtained the following data (Table 4.2):

Since these are counts of discrete categories (there’s no numerical order to A, C, G, and T), if we assume A, C, G, and Ts in the genome are i.i.d., we can model this data using the multinomial distribution. Let’s go ahead and estimate the proportions by maximum likelihood using the formulas given earlier (except that I’ve switched the variable name from X to Z to avoid confusions between the random variable X and the chromosome X) Similar calculations give the MLEs for the parameters of two chromosomes Obviously, these proportions are similar, but not exactly the same. Are these differences statistically significant? The LRT will help us decide.

The null hypothesis is that the X and Y chromosomes have the same proportions of the four nucleotides, while the alternative hypothesis is that they actually have different proportions. To calculate, the LRT have to maximize the likelihood under the two hypotheses. This means we have to calculate the likelihood (or log-likelihood) when the parameters

TABLE 4.2 Numbers of Each Nucleotide on the Human X and Y Chromosomes

 A C G T X chromosome 45,648,952 29,813,353 29,865,831 45,772,424 Y chromosome 7,667,625 5,099,171 5,153,288 7,733,482

are the MLEs. In this case, it’s easiest to start with the alternative hypothesis: we believe that the X and Y chromosomes have different proportions of the four bases, so we allow each chromosome to have four different parameters, so the MLEs are just the proportions we calculated earlier.

Under the null hypothesis, we believe that the two chromosomes have the same proportions of nucleotides, so that a single set of parameters can explain all of the data. We estimate these simply by pooling all of the data: where I’ve used the 0 to indicate that these are the parameters under the null hypothesis H0. We can now go ahead and calculate the log-likelihood ratio, which is (see Exercises) Do your best to navigate the notation here. I’m using Z to represent the numbers of nucleotides actually observed. In practice, we never want to try to calculate 0.3 45,648,952, we’ll actually carry the log through and convert the products to sums and the exponents to products: We can now go ahead and compute the LRT test statistic, which is just -2 times this, or about 2295. We now have to look up this value of the test statistic in the null distribution, which is the chi square, with df equal to the difference in the number of free parameters between the two hypotheses.

It can be a bit tricky to figure this out: for example, it looks like we estimated four parameters for our null hypothesis and eight parameters for our alternative hypothesis (the fs in each case). However, as I mentioned earlier, the parameters of a multinomial have to add up to 1, so in fact, once we choose three of them, the fourth one is automatically specified. Therefore, in this case, there are really only three free parameters in our null hypothesis and six free parameters in the alternative hypothesis. As it turns out, it doesn’t matter in this case because 2295 is way beyond what we would expect from either of these distributions: the P-value is approximately 10-495. So we have extremely strong support for the alternative hypothesis against the null hypothesis.

So in this example, I found extremely strong statistical support for very tiny differences in the GC content between the human X and Y chromosomes. I think this is a good example of the kinds of results that are easy to obtain with very large genomic data, but need to be interpreted with caution. Although we can come up with lots of hypotheses for why these tiny differences have biological significance, before we do that, I think it’s important to remember that every statistical hypothesis test has assumptions. Let’s revisit out assumptions: the null distribution of the LRT assumed that the sample size was approaching infinity. Well, actually we had millions of datapoints, so that assumption seems ok. What about the use of the multinomial distribution in the first place? We assumed that the individual letters in the genome were i.i.d. In fact, we know that assumption is almost certainly wrong: GC isocores, CpG biased mutations, repeat elements, homopolymeric runs all make the positions in the genome very highly correlated. In fact, a quick calculation shows that on the Y chromosome, there are nearly 50% more As when the preceding nucleotide is an A than when the preceding nucleotide is a T, even though overall A and T both appear very close to 30% of the time.

Thus, the highly significant (but tiny) differences that we see between the X and Y chromosomes could be biological, but they could also result from an unrealistic assumption we made when we designed the statistical test. As datasets become larger, we have more and more power to detect interesting biological effects, but we also have more power to obtain highly significant results that are due to unrealistic assumptions. Perhaps the best way to check if your statistical test is behaving as you hope is to include a biological negative control. In this example, we could perform the same test for other chromosome pairs and see if we find the same differences. Even better, we could chop up the X and Y chromosomes into large chunks and glue these back together to create random fake chromosomes. If we still find statistical differences between these, we can conclude that the differences we are seeing are artifacts of our statistical test, and not real differences between chromosomes. On the other hand, if our fake chromosomes don’t show any effects, we can begin to believe that there really are differences between the X and Y chromosomes.