# FROM HYPOTHESIS TESTING TO STATISTICAL MODELING: PREDICTING PROTEIN LEVEL BASED ON mRNA LEVEL

By far, the most common use of linear regression in molecular biology is to detect statistical associations as illustrated in the eQTL example—to reject the hypothesis that X and Y are independent (by calculating the P-value associated with a correlation coefficient). This is because we don’t usually have quantitative predictions about the relationships between biological variables—we are looking for differences between mutant and wt, but we don’t have a particular hypothesis about what that difference should be. Or we know that evolutionary rates should be slower for more important proteins, but we don’t really know how much slower.

However, regression can also be used to quantify the relationship between variables that we know are correlated. For example, the central dogma of molecular biology is: DNA makes RNA makes protein. Let’s try to turn this into a quantitative prediction (following Csardi et al. 2015): under the assumptions that (1) RNA and protein degradation are constant, and (2) the cell has reached some kind of equilibrium state, the abundance of each protein in the cell would simply be proportional to the mRNA for that protein, with the constant given by the ratio of translation to degradation rates. In this case, we aren’t really interested in rejecting the null hypothesis of independence—we know proteins depend on mRNA—instead, we want to test if there really is a simple linear relationship between protein abundance and mRNA abundance. If we take logs of both sides, we get Setting Y = log Protein and X = log mRNA, we have a regression problem where b0 is the log of the translation to degradation ratio, and if there really is a simple linear relationship between protein and mRNA levels, we expect to find b1 = 1. This means that we can use this simple regression to estimate b1, and because we know our estimate of b1 will be Gaussian with standard deviation given, we can test it is statistically different from 1, thus testing the hypothesis that there is a simple linear relationship between mRNA and protein levels.

The results of this analysis are shown in Figure 7.4. If we do the analysis on the whole dataset, we obtain a slope (b1) of 0.837 and the standard deviation of the estimate is less than 0.02. This means that we are more than 8 standard deviations away from 1. Since we know the distribution of b1 should be Gaussian, this is an incredibly unlikely observation if the slope was truly 1. However, although we can rule out 1 with large statistical confidence, 0.837 is actually not that far from 1.

Looking at the data more carefully (see Figure 7.4), it’s clear that something funky is happening with our data at high mRNA levels (log mRNA > 3). Instead of protein levels rising simply, there seem to be two classes of genes: those with high protein levels as expected (around log Protein = 10) and those with unexpectedly low protein levels (around log Protein = 8). These genes are violating not only the simple linear assumption, but even the assumption that there is any consistent relationship between mRNA and protein levels. Another way to see this is to look at the distribution of residuals. As I mentioned, the residual is the difference between the observed Ys and the predicted Y, that is, Y - E[Y|XJ. The distributions of residuals are shown at the bottom panel of the figure for the genes with typical mRNA levels and for genes with the most abundant transcripts. It’s clear from these histograms that there is a subpopulation of highly expressed genes for which the model is predicting much bigger protein levels than are observed (bump in the histogram around -5). It turns out that these highly expressed genes represent less than 5% of the data in the analysis. FIGURE 7.4 A regression model of the relationship between mRNA and protein abundance. Each gray symbol represents one of >4000 yeast genes for which at least three mRNA and protein measurements were available. (Data were collated by Csardi, G. et al., 2015.) (a) The regression model for the whole dataset, (b) the residuals of the regression with genes divided into two categories based on their abundance, and (c) the regression after removing the genes with the largest mRNA abundances.

If we remove these genes with high abundance transcripts, we obtain the analysis shown on the bottom of the figure. Doing the regression analysis on these genes gives an estimate of the slope (b1) to be 0.997 with standard deviation of 0.022. Needless to say, this is very close to the slope of 1 that we predicted, and we can say that the data is consistent with our model—in fact, we can say that for this 95% of the data, the simple linear model explains the data to about 2% accuracy. The regression analysis indicates that on average, there probably is something close to a linear relationship between protein and mRNA abundance for most genes, but that this relationship breaks down for the most highly expressed genes in the genome. Interestingly, if we continue to remove the highly expressed genes from the dataset, the estimate of the slope actually gets a little bit larger than 1, suggesting that even on average there are probably subtle deviations from the simple model.

Looking carefully at the data and plotting the distribution of residuals is a powerful general approach to evaluating regression results. If you want to do statistical hypothesis testing on a regression model, confirming that the distribution of the residuals follows your assumptions is the best way to know if your P-values (and confidence intervals on parameters) are likely to be reliable. This example also illustrates another important general point: simple regression analysis is sensitive to small numbers of datapoints (possibly outliers) disproportionately affecting the result.

Because in this analysis we included thousands of genes, we had very large statistical power to test our hypothesis. However, as with all statistical modeling, our ability to test the hypothesis depends on the assumptions. In this case, I want to draw attention to the assumptions of linear regression about noise. It is a technically challenging thing to measure the abundance for all the proteins and mRNAs in the cell. There is sure to be experimental noise in these measurements. The simple regression model we used here included a Gaussian model for noise in the log (so-called “multiplicative noise,” see Exercises) of the protein measurements, Y. However, the model we used assumes that the mRNA levels, X, are perfectly measured, which is almost certainly not right. It turns out that there are regression techniques (beyond the scope of this book) that allow noise in both the Y and X to be included in the model. In this case, that type of model turns out be more appropriate, and the story is more complicated that what I described here (Csardi et al. 2015).

## FIVE GREAT THINGS ABOUT SIMPLE LINEAR REGRESSION

• • Fast, simple formulas to estimate parameters
• • Straightforward hypothesis testing in the maximum likelihood framework
• • Interpretable measures of model fit even when the distribution of the data is unknown
• • Nonparametric tests of association based on Spearman's rank correlation
• • Easy to evaluate the assumptions using the distribution of residuals