As I've already mentioned, a convenient way to think about multiple observations at the same time is to think about them as lists of numbers, which are known as vectors in mathematical jargon. We will refer to the list of numbers X as a vector X = (x1, x2, x3, ..., xn), where n is the "length" or dimensionality of the vector (the number of numbers in the list). Once we've defined these lists of numbers, we can go ahead and define arithmetic and algebra on these lists. One interesting wrinkle to the mathematics of lists is that for any operation we define, we have to keep track of whether the result is actually a list or a number. For example, a simple subtraction of one vector from another looks like

which turns out to be a vector. On the other hand, the dot product

gives us is a number. The generalization of algebra means that we can write equations like

which means that

and is therefore a shorthand way of writing n equations in one line.

Since mathematicians love generalizations, there's no reason we can't generalize the idea of lists to also include a list of lists, so that each element of the list is actually a vector. This type of object is what we call a matrix. A = (X-|, X2, X3, ..., Xm), where X-, = (x1, x2, x3, ..., xn). To refer to each element of A, we can write A11, A12, A13, ..., A21, A22, A23, ..., Amn. We can then go ahead and define some mathematical operations on matrices as well: If A and B are matrices, A - B = C means that for all i and j, Cj=Aj - Bj.

We can also do mixtures of matrices and vectors and numbers:


c is a number x and y are vectors A is a matrix

This turns out to be a vector.

However, there's one very inelegant issue with the generalization to matrices: what we mean when we refer to the value Aj depends on whether the i refers to the index in 1 through m or 1 through n. In other words, we have to keep track of the structure of the matrix. To deal with this issue, linear algebra has developed a set of internally consistent notations, which are referred to as the "row" or "column" conventions. So anytime I write the vector, x, by default I mean the "column" vector

To indicate the "row" vector, I have to write the "transpose" of X, or XT. The transpose is defined as the operation of switching all the rows and columns. So in fact, there are two kinds of products that can be defined:

which is the familiar dot product also known as the "inner product," and produces a number, while

which is the so-called "outer" product that takes two vectors and produces a matrix.

Although you don't really have to worry about this stuff unless you are doing the calculations, I will try to use consistent notation, and you'll have to get used to seeing these linear algebra notations as we go along.

Finally, an interesting point here is to consider the generalization beyond lists of lists: it's quite reasonable to define a matrix, where each element of the matrix is actually a vector. This object is called a tensor. Unfortunately, as you can imagine, when we get to objects with three indices, there's no simple convention like "rows" and "columns" that we can use to keep track of the structure of the objects. I will at various times in this book introduce objects with more than two indices—especially when dealing with sequence data. However, in those cases, it won't be obvious what the generalizations of addition, subtraction from linear algebra mean exactly, because we won't be able to keep track of the indices. Sometimes, I will try to use these and things might get confusing, but in many cases I'll have to write out the sums explicitly when we get beyond two indices.

Matrices also have different types of multiplications: the matrix product produces a matrix, but there are also inner products and outer products that produce other objects. A related concept that we've already used is the "inverse" of a matrix. The inverse is the matrix that multiplies to give a matrix with 1's along the diagonal (the so-called identity matrix, I).

Although this might all sound complicated, multivariate statistics is easy to understand because there’s a very straightforward, beautiful geometric interpretation to it. The idea is that we think of each component of the observation vector (say, each gene’s expression level in a specific cell type) as a “dimension.” If we measure the expression level of two genes in each cell type, we have two-dimensional data. If we measure three genes, then we have three-dimensional data. If 24,000 genes, then we have ... you get the idea. Of course, we won’t have an easy time making graphs of 24,000-dimensional space, so we’ll typically use two- or three-dimensional examples for illustrative purposes. Figure 4.2 tries to illustrate the idea.

In biology, there are lots of other types of multivariate data. For example, one might have observations of genotypes and phenotypes for a sample of individuals. Another ubiquitous example is DNA sequences: the letter

Multivariate observations as vectors

FIGURE 4.2 Multivariate observations as vectors. Different types of data typically encountered in molecular biology can be represented as lists or vectors. The top left shows i.i.d. multivariate observations in a pool. Each observation corresponds to a vector (list of observations) of length d. On the bottom right, a single gene expression observation for three genes is represented as a vector in three-dimensional space. On the top right, a two-dimensional observation of phenotype and genotype is indicated in a space with a discrete horizontal dimension and a continuous vertical dimension. On the bottom center, a codon is represented as a point in three-dimensional space where each dimension corresponds to one codon position. In the bottom right, a single nucleotide position is represented in a four-dimensional space. Note that discrete observations like sequences or genotypes can be represented mathematically in different ways.

at each position can be thought of as one of the dimensions. In this view, each of our genomes represents three billion dimensional vectors sampled from the pool of the human population. In an even more useful representation, each position in a DNA (or protein) sequence can be represented as a 4 (or 20)-dimensional vector, and the human genome can be thought of as a 3 billion x 4 matrix of 1s and 0s. In these cases, the components of observations are not all numbers, but this should not stop us from using the geometrical interpretation that each observation is a vector in a highdimensional space (Figure 4.2 illustrates multivariate data).

A key generalization that becomes available in multivariate statistics is the idea of correlation. Although we will still assume that the observations are i.i.d., the dimensions are not necessarily independent (Figure 4.3). For example, in a multivariate Gaussian model for cell-type gene expression, the observation of a highly expressed gene X might make us more likely to observe a highly expressed gene Y. In the multivariate Gaussian model, the correlation between the dimensions is controlled by the off-diagonal elements in the covariance matrix, where each off-diagonal entry summarizes the correlation between a pair of dimensions (Figure 4.3). Intuitively, an off-diagonal term of zero implies that there is no correlation between two dimensions. In a multivariate Gaussian model where all the dimensions are independent, the off-diagonal terms of the covariance matrix are all zero, so the covariance is said to be diagonal. A diagonal covariance leads to a symmetric, isotropic, or (most confusingly) “spherical” distribution.

< Prev   CONTENTS   Source   Next >