So you’ve done an experiment. Most likely, you’ve obtained numbers.
If you didn’t, you don’t need to read this book. If you’re still reading, it means you have some numbers—data. It turns out that data are no good on their own. They need to be analyzed. Over the course of this book, I hope to convince you that the best way to think about analyzing your data is with statistical modeling. Even if you don’t make it through the book, and you don’t want to ever think about models again, you will still almost certainly find yourself using them when you analyze your data.
WHAT IS STATISTICAL MODELING?
I think it’s important to make sure we are all starting at the same place. Therefore, before trying to explain statistical modeling, I first want to discuss just plain modeling.
Modeling (for the purposes of this book) is the attempt to describe a series of measurements (or other kinds of numbers or events) using mathematics. From the perspective of machine learning, a model can only be considered useful if it can describe the series of measurements more succinctly (or compactly) than the list of numbers themselves. Indeed, in one particularly elegant formulation, the information the machine “learns” is precisely the difference between the length of the list of numbers and its compact representation in the model. However, another important thing (in my opinion) to ask about a model, besides its compactness, is whether it provides some kind of “insight” or “conceptual simplification” about the numbers in question.
FIGURE 2.1 Example of observations explained by Newton’s law of universal gravitation and Newton’s law of cooling. Models attempt to describe a series of numbers using mathematics. On the left, observations (x1, x2, x3, ...) of the position of a planet as it wanders through the night sky are predicted (dotted line) by Newton’s law of universal gravitation. On the right, thermometer readings (Tj, T2, T3, ...) decreasing according to Newton’s law of cooling describe the equilibration of the temperature of an object whose temperature is greater than its surroundings (T0).
Let’s consider a very simple example of a familiar model: Newton’s law of universal gravitation (Figure 2.1, left panel). A series of measurements of a flying (or falling) object can be replaced by the starting position and velocity of the object, along with a simple mathematical formula (second derivative of the position is proportional to mass over distance squared), and some common parameters that are shared for most flying (or falling) objects. What’s so impressive about this model is that (1) it can predict a huge number of subsequent observations, with only a few parameters, and (2) it introduces the concept of gravitational force, which helps us understand why things move around the way they do.
It’s important to note that physicists do not often call their models “models,” but, rather “laws.” This is probably for either historical or marketing reasons (“laws of nature” sounds more convincing than “models of nature”), but as far as I can tell, there’s no difference in principle between Newton’s law of universal gravitation and any old model that we might see in this book. In practice, the models we’ll make for biology will probably not have the simplicity or explanatory power that Newton’s laws or Schrodinger’s equation have, but that is a difference of degree and not kind. Biological models are probably more similar to a different, lesser known physics model, also attributed to Newton: his (nonuniversal) “law of cooling,” which predicts measurements of the temperatures of certain types of hot things as they cool off (Figure 2.1, right panel). Although this model doesn’t apply to all hot things, once you’ve found a hot thing that does fit this model, you can predict the temperature of the hot thing over time, simply based on the difference between the temperature of the thing and its surroundings. Once again, a simple mathematical formula predicts many observations, and we have the simple insight that the rate at which objects cool is simply proportional to how much hotter they are than the surroundings. Much like this “law of cooling,” once we’ve identified a biological system that we can explain using a simple mathematical formula, we can compactly represent the behavior of that system.
We now turn to statistical modeling, which is our point here. Statistical modeling also tries to represent some observations of numbers or events— now called a “sample” or “data”—in a more compact way, but it includes the possibility of randomness in the observations. Without getting into a philosophical discussion on what this “randomness” is (see my next book), let’s just say that statistical models acknowledge that the data will not be “fully” explained by the model. Statistical models will be happy to predict something about the data, but they will not be able to precisely reproduce the exact list of numbers. One might say, therefore, that statistical models are inferior, because they don’t have the explaining power that, say, Newton’s laws have, because Newton always gives you an exact answer. However, this assumes that you want to explain your data exactly; in biology, you almost never do. Every biologist knows that whenever they write down a number, a part of the observation is actually just randomness or noise due to methodological, biological, or other experimental contingencies. Indeed, it was a geneticist (Fisher) who really invented statistics after all—250 years after Newton’s law of universal gravitation. Especially with the advent of high-throughput molecular biology, it has never been more true that much of what we measure in biological experiments is noise or randomness. We spend a lot of time and energy measuring things that we can’t and don’t really want to explain. That’s why we need statistical modeling.
Although I won’t spend more time on it in this book, it’s worth noting that sometimes the randomness in our biological observations is interesting and is something that we do want to explain. Perhaps this is most well- appreciated in the gene expression field, where it’s thought that inherently stochastic molecular processes create inherent “noise” or stochasticity in gene expression (McAdams and Arkin 1997). In this case, there has even been considerable success predicting the mathematical form of the variability based on biophysical assumptions (Shahrezaei and Swain 2008). Thus, statistical modeling is not only a convenient way to deal with imperfect experimental measurements, but, in some cases, the only way to deal with the inherent stochasticity of nature.