HOW DO I KNOW IF MY DATA FITS A PROBABILITY MODEL?
For example, let's consider data from a single-cell RNA-seq experiment (Shalek et al. 2014). I took the measurements for 1 gene from 96 control cells—these should be as close as we can get to i.i.d—and plotted their distribution in Figure 2.3. The measurements we get are numbers greater than or equal to 0, but many of them are exactly 0, so they aren't well-described by a Gaussian distribution. What about Poisson? This is a distribution that gives a probability to seeing observations of exactly 0, but it's really supposed to only model natural numbers like 0, 1, 2, ..., so it can't actually predict probabilities for observations in between. The exponential is a continuous distribution that is strictly positive, so it is another candidate for this data.
Attempts to fit this data with these standard distributions are shown in Figure 2.3. I hope it's clear that all of these models underpredict the large number of cells with low expression levels, (exactly 0) as well as the cells with very large expression levels. This example illustrates a common problem in modeling genomic data. It doesn't fit very well with any of the standard, simplest models used in statistics. This motivates the use of fancier models, for example, the data from single-cell RNA-seq experiments can be modeled using a mixture model (which we will meet in Chapter 5).
But how do I know the data don't fit the standard models? So far, I plotted the histogram of the data compared to the predicted probability distribution and argued that they don't fit well based on the plot. One can make this more formal using a so-called quantile-quantile plot (or qq-plot for short). The idea of the qq-plot is to compare the amount of data appearing up to that point in the distribution to what would be expected based on the mathematical formula for the distribution (the theory). For example, the Gaussian distribution predicts that the first ~0.2% of the data falls below 3 standard deviations of the mean, 2.5% of the data falls below 2 standard deviations of the mean, 16% of the data falls below 1 standard deviation of the mean, etc. For any observed set of data, it's possible to calculate these so-called "quantiles" and compare them to the predictions of the Gaussian (or any standard) distribution. The qq-plot compares the amount of data we expect in a certain part of the range to what was actually observed. If the quantiles of the observed data agree pretty well with the quantiles that we expect, we will see a straight line on the qq-plot. If you are doing a statistical test that depends on the assumption that your data follows a certain distribution, it's a quick and easy check to make a qq-plot using R and see how reasonable the assumption is.
Perhaps more important than the fit to the data is that the models make very different qualitative claims about the data. For example, the Poisson model predicts that there's a typical expression level we expect (in this case, around 3), and we expect to see fewer cells with expression levels much greater or less than that. On the other hand, the exponential model predicts that many of the cells will have 0, and that the number of cells with expression above that will decrease monotonically as the expression level gets larger. By choosing to describe our data with one model or the other, we are making a very different decision about what's important.