So far, we’ve been thinking about regressing one list of numbers on another, but this is only the very simplest use of linear regression in modern molecular biology. Regression can also help us think about the kinds of high-dimensional data vectors that we were considering in the context of clustering and sequence analysis in Chapters 5 and 6. The major way regression is used in this context is when we reinterpret X as a highdimensional set of variables that might explain Y.
PREDICTING Y USING MULTIPLE Xs
A good motivation for multiple regression is the data we’ve already seen in Chapters 3 and 7 for quantitative trait loci for gene expression or eQTL analysis: We might be trying to explain gene expression level as function of genotypes at many loci that we’ve measured for that individual. We want to see if the level of a gene Y is associated with the genotype at a combination of loci.
Multiple regression will allow us to make predictions about Y based on multiple dimensions of X. Notice that in this example (Figure 8.1), the relationship is not strictly linear: The effect of the genotype on chr8 only has an effect on gene expression if the cells have the “a” allele on chr3. This type of behavior can be included in a linear model by including another dimension of X that is set to be the product of the genotypes at the two loci: If the genotypes are represented as 0 and 1 (e.g., for a and A), the product is only 1 when an individual has 1 genotype at both alleles. In this example, there would be a negative coefficient associated with the product of the genotypes: SAG1 expression is lower if you have 1 at both alleles. However, even without this
FIGURE 8.1 The expression level of the yeast SAG1 gene is associated with the genotypes at two loci. To regress a phenotype (like gene expression) on genotype, we use 0 and 1 (and 2 for a diploid) to represent the possible genotypes at each locus. (Data from Brem, and Kruglyak, 2005.)
interaction term, it is totally reasonable to test for an association between both alleles and the gene expression level using multiple regression.
To define multiple regression more formally, we now have a vector of “features” or “covariates” (Xs) that we are trying to use to explain a “response” (Y). Once again, we will assume that they are related by a simple linear model so that
where I have included the index i for the ith observation of Y (which is so far assumed to still be a simple number) and introduced j to index the m-dimensional feature vector Xi. Notice that as it is typical with multivariate statistics, I have tried to write this conveniently using linear algebra. However, in this case it’s a bit inelegant because I have to add an extra 1 at the beginning of the feature vector X to make sure the b0 term is included, but not multiplied by the first dimension of the Z. Figure 8.2 shows the structure of the multiple regression problem.
Now we can write out a Gaussian likelihood for the regression where the mean is just Xib, and everything else is the same as in the univariate case:
FIGURE 8.2 The multiple regression problem expressed using linear algebra notation. The response, Y, is predicted to be the product of the data matrix, X, and a vector of coefficients, b. The sizes of the vectors and matrices correspond to the number of datapoints or the number of features.
Note that all the assumptions of univariate regression still hold: We are still assuming the features, X, are measured perfectly (no noise) and that the observations, indexed by i, are i.i.d.