NAIVE BAYES: GENERATIVE MAP CLASSIFICATION

Naive Bayes is one of the most widely used classification strategies and does surprisingly well in many practical situations. Naive Bayes is a generative method and allows each dimension to have its own distribution. However, the key difference between naive Bayes and LDA is that naive Bayes assumes that the dimensions are independent (Figure 10.5). This way, even if the data are modeled as Gaussian, the number of parameters is linear in the number of dimensions. Finally, naive Bayes uses the MAP classification rule so that it also includes a prior for each class, which

FIGURE 10.5 Different generative models underlie LDA and Naive Bayes. This figure shows a comparison of the Gaussian models estimated for the two classification models. On the left, LDA assumes a single covariance matrix that captures the strong positive correlation in this data. On the right, naive Bayes captures the greater variance in the positive class by using different variances for each class, but assumes the dimensions are independent, and therefore misses the covariance. In this example, neither class fits Gaussians very well, so neither generative classification model is expected to find the optimal (linear) classification boundary.

turns out to be very important for datasets where the number of positives and negatives are not similar. To use the MAP rule when there are two classes, we’d like to compare the posterior to 0.5, so the decision boundary is given by

As before, X represents a single new observation, and Y represents the unobserved (predicted) class of that observation. Since we have a generative model, we use Bayes’ theorem and write

As I mentioned, the key assumption of the naive Bayes classifier is that it assumes that all the dimensions of X are independent conditioned on the class. This means we can write

where I have written 0: for the parameters of the probability distribution (or model) for the data of each dimension for the positive class. This formula says: To get the probability of the multidimensional observation, X, we just multiply together the probabilities of each component of X. Of course, we still have to define the probability distribution for each dimension for each class, but I hope it’s clear that the assumption of independence between dimensions leads to a great simplification. For example, if we choose a Gaussian distribution, we don’t have to worry about covariance matrices: We simply need a single mean and variance for each class for each dimension.

Plugging the independence assumption back into the formula for the posterior probability gives

where I have used n to represent the prior probability of observing the positive class. Under Naive Bayes, the decision boundary turns out to be a logistic function based on the sum of the log likelihood ratios for each class, with an extra parameter related to the priors. This equation can be solved easily just by thinking about when the logistic function equals exactly U.

or

So the MAP rule says to compare the log-likelihood ratio to a cutoff related to the priors: The smaller the prior on the positive class, the larger the likelihood ratio (which is based on the data) needs to be before you classify the new datapoint as positive. In this way, the priors allow us to represent the maxim “extraordinary claims require extraordinary evidence.”

We can go ahead and make this more specific by assuming distributions for each dimension. For example, in the case of categorical distributions for each dimension (as used for sequence data in Chapters 4 and 6), we can use clever indicator variable notation from Chapter 6 to write

where f are the parameters of the discrete distribution for class k, corresponding to the categories b that could be observed in each dimension j. The naive Bayes classification boundary works out to be

In fact, this is exactly the statistic used to identify new examples of DNA motifs in the genome.