LINEAR DISCRIMINANT ANALYSIS (LDA) AND THE LOG LIKELIHOOD RATIO
In Chapter 6, we considered clustering using “hidden variables” that were 1 if the datapoint was in a particular cluster, and 0 otherwise. We showed that the computer could automatically learn a different model for each cluster or hidden state. The algorithm would then automatically separate datapoints into clusters corresponding to the values of these hidden variables.
If we think of the two “clusters” in the data as the two classes that we would like to separate (e.g., the suitcases with illegal substances and those that should be allowed to pass through), clustering using probabilistic models with hidden variables could be considered a classification task, except that we didn’t do any training to learn what each type of suitcases smelled like. Clustering is therefore also called “unsupervised classification” to reflect that it can be thought of as classification without training (or supervision, in the machine learning jargon). Standard classification is referred to as “supervised” because we first teach the algorithm (sniffer dog) what the classes look like by showing it “training data” examples that are “labeled” by their class, and then apply it on new “unlabeled” examples. In general, supervised classification methods are expected to be much more accurate than their unsupervised counterparts. We will now consider two classification methods (LDA and Naive Bayes) that can be considered the supervised equivalents of Gaussian mixture models.
Linear discriminant analysis (or LDA) is a probabilistic classification strategy where the data are assumed to have Gaussian distributions with different means but the same covariance, and where classification is typically done using the ML rule. The training step for LDA consists of estimating the means for each class, and the covariance of the entire dataset. For example, to perform LDA on the data in Figure 10.1, we would find the mean expression for the T cells and the mean expression for all other cells, say p1 (the mean when Y = 1) and p„ (the mean when Y = 0). We then estimate a single covariance matrix Z for the whole data. This is done using the standard formulas for multivariate Gaussians that we saw in Chapter 4. For a new datapoint number n + 1 (dropping the n + 1 subscript notation for simplicity), we evaluate the ML rule
for the specific case of LDA, or more generally
for binary classification. Dividing both sides by the probability of the data given the negative class, we have
The quantity on the left of this equation can be thought of as a likelihood ratio (LR), because we compute the probability of the data (a single observation, X) given the model for the positive class and compare it to the probability of the same data given the model for the negative class. Thus, the ML rule can be written as LR > 1, or more often as
The log-likelihood ratio (log LR) is a widely used statistic for classification. (It should not be confused with the likelihood ratio test statistic, and does not, in general, have a known distribution under a null hypothesis.) Going back to the specific case of LDA where we assume Gaussian models, this is
which can actually be solved analytically to be
Although this formula is a bit complicated, it actually has a very elegant geometric interpretation. It’s easiest to see this by first considering the case where the covariance is the identity matrix. If the covariance is I,
The left side of this equation says to take the new observation X and project it onto the vector that is the difference between the two means. The right is the projection of the average of the two means onto the difference of the two means. Geometrically, this is the midpoint along the vector that connects the two means. So the ML rule says to classify the observation as a positive if the projection of the observation onto the difference is more than halfway along the line between the two means (illustrated in Figure 10.3).
I hope this makes a lot of sense: In the LDA model, where the variance is the same in the two classes, the vector between the two means is the direction that contains the information about difference between the two classes. Therefore, the projection on this vector is the part of the observation that contains the information about which class the new observation is in. We can think about LDA as a projection of the observation vector onto a one-dimensional coordinate system that optimally separates the two classes. In that dimension, you simply assign the new observation to the mean that it’s closer to (more than halfway in the two class scenario).
FIGURE 10.3 Geometric interpretation of the LDA classification rule. Vectors p and p0 illustrate the means for the positive and negative classes respectively. The decision boundary (dotted line) is orthogonal to the vector between the two means (p - p0) and intersects at the average of the two means (p! +p2)/2.
All the other components of X can be thought of as distractions: They don’t tell us anything about which class X belongs to.
This geometric interpretation also tells us how to think about the general case where the covariance of the data is not the identity: Before we project the observation X onto the difference between the means, we do another transformation of the space. This corresponds to a rotation or multiplication by the inverse of the covariance matrix. Indeed, common practice in linear classification is to estimate a single covariance matrix and multiply the data by the inverse of that matrix (which corresponds to a rotation). This transforms the data into a coordinate system where the covariance is I, and is called “whitening” because white noise is considered to be a Gaussian with covariance I.
DERIVING THE LDA DECISION BOUNDARY
We start with the ML rule, namely, assign the data, X, to the more likely class:
To derive the decision boundary, let's first look at the log of the Gaussian distribution.
Notice that the first term depends only on the covariance (and the dimensionality, d) which, by assumption, is the same for both classes in LDA. This means that this term will cancel out in the log ratio. Multiplying out the second term, we get
and, once again, we have one term that only depends on the covariance and the data. This term will therefore cancel out. We will be left with
We now factor the terms and multiply by /. which is the formula given above.