One way to make sense of different types of classification strategies is to consider whether the classification method tries to make a probabilistic model for X (i.e., P(X|0)), or whether it models P(Y|X) directly (Figure 10.4). In the machine learning jargon, these two types of methods are referred to as generative and discriminative, respectively. For the methods described so far, LDA is a generative method because it assumes a multivariate Gaussian model for each class. Generative models are appealing because it’s easy to state the assumptions and to understand what the data would look like under the ideal case. Generative methods also tend to work better when the models they assume about the data are reasonably realistic, but they can also work pretty well even in certain situations when the data doesn’t fit the model all that well.

For example, the idea of whitening and then classifying makes sense for LDA because the two classes are assumed to have the same covariance.

Structure of probabilistic classification models

FIGURE 10.4 Structure of probabilistic classification models. Both generative and discriminative models for classification try to predict the unobserved class (Y) for a new datapoint (n + 1) based on a model trained on observed training examples. (a) The generative model defines the distribution of the features given the class, and then tries to infer the unobserved class using Bayes theorem. (b) The discriminative model assumes that the probability distribution of Y directly depends on the features, X.

This assumption also makes the classification much less sensitive to errors in estimation of the covariance. As I’ve mentioned a few times already, estimating the covariance for multivariate Gaussians is a major practical hurdle for high-dimensional data. However, because in LDA errors in the estimation of the covariance will be the same for both classes, the impact of bad estimation on the classification decision will tend to cancel out. In general, the models assumed by generative methods will not actually fit the data, so it’s wise to use a classification strategy that will not assign new observations to one class because they simply fit the assumptions of the classifier better. However, if the data for both classes deviates from the model in the same (or similar) way, the generative classifier will often do just fine. For example, imagine that the new datapoint you observed was very far away from both mean vectors in LDA, and was extremely unlikely under the Gaussian models for both classes. You can still go ahead and classify this new observation because only the part of the observation that’s in the direction of the difference between means is important. This is one nice feature of classification using a likelihood ratio: You don’t need to claim that the model for either class fits the data very well—you can still assign the new data to the class that fits best.

Discriminative methods are classification strategies that don’t try to make a model of the data for each class. They simply try to get a good classification boundary. Logistic regression is a great example of a discriminative strategy. In general, these methods will work better when the data fit very poorly to a generative model (or a generative model requires large numbers of parameters that are hard to estimate). The downside of discriminative models is that they are harder to interpret and understand— they are more of a “black box”—and this can make it more difficult to train parameters while avoiding overfitting. Since they don’t actually make a model of the data, discriminative models don’t use the ML rule.

< Prev   CONTENTS   Source   Next >