An important generalization of the statistical models that we’ve seen so far is to the case where multiple events are observed at the same time. In the models we’ve seen so far, observations were single events: yes or no, numbers or letters. In practice, a modern molecular biology experiment typically measures more than one thing, and a genomics experiment might yield measurements for thousands of things: for example, measurements for all the genes in the genome.
A familiar example of an experiment of this kind might be a set of genome-wide expression level measurements. In the ImmGen data, for each gene, we have measurements of gene expression over ~200 different cell types. Although in the previous chapters we considered each of the cell types independently, a more comprehensive way to describe the data is that for each gene, the observation is actually a vector, X, of length ~200, where each element of the vector is the expression measurement for a specific cell type. Alternatively, it might be more convenient for other questions to think of each observation as a cell type, where the observation is now a vector of 24,000 gene expression measurements. This situation is known in statistics as “multivariate” to describe the idea that multiple variables are being measured simultaneously. Conveniently, the familiar Gaussian distribution generalizes to the multivariate case, except the single numbers (scalar) mean and variance parameters are now replaced with a mean vector and (co)variance matrix:
In the formula for the multivariate Gaussian, I’ve used a small d to indicate the dimensionality of the data, so that d is the length of the vectors ц and X, and the covariance is a matrix of size d x d. In this formula, I’ve explicitly written small arrows above the vectors and bolded the matrices. In general, the machine learning people will not do this (I will adopt their convention) and it will be left to the reader to keep track of what are the scalars, vectors, and matrices. If you’re hazy on your vector and matrix multiplications and transposes, you’ll have to review them in order to follow the rest of this section (and most of this book).