# PROBABILISTIC CLASSIFICATION MODELS

Luckily for us, several widely used classification methods follow directly from the probabilistic models I described for linear regression and clustering. For these classification methods, we’ll have the likelihood as the objective function and train parameters by maximizing the likelihood. We’ll therefore focus on these familiar models to illustrate the mathematical magic behind classification.

To represent classification tasks mathematically, we’ll usually assign one of the categories (say, suitcases containing illegal substances) to be 1 and the other (suitcases that are fine) to be 0. In the case of two classes, we’ll refer to one class as “positive” and the other as “negative.” It will always be arbitrary which one we assign to be 1 or 0, but we’ll always get the same answer if we did it the other way. In this formulation, we can then think of classification as a prediction about whether a new observation is 1 or 0.

If probabilistic models are to be used for classification, we can think of a new high-dimensional datapoint that we draw from a pool as having observed dimensions *X* (the “features”), as well as an “unobserved” dimension, that represents its true class, *Y* (the “target”). For example, let’s say we have trained a sniffer dog (or classification model) and we are given a new suitcase (or observation). We observe X„_{+1} and we wish to fill in the unobserved corresponding *Y _{n}+_{1}* based on the training data X

_{p}...,

*X*and Yp ..., Y

_{n}_{n}. If we are using a probabilistic model, we will use the training data to estimate some parameters 0, for example, by maximizing the likelihood P(X

_{p}...,

*X*

_{n}, Y_{1},..., Y

_{n}|0). In addition to choosing the parameters of the model to maximize an objective function (typically done in the training stage), we will therefore also have to consider a rule to fill in the “unobserved” data

*Y*(in the prediction stage). Since the objective functions for the training probabilistic models will be the usual suspects that we’ve seen for regression and clustering (likelihood, posterior probability, penalized likelihood, etc.), we won’t discuss them again here. Instead, we’ll spend a lot of time thinking about how to decide (mathematically) whether the new suitcase, number

_{n+1}*n*+ 1 (with associated features

*X*

_{n+1}), should pass through (whether the unobserved Y

_{n}+

_{1}was really 1 or 0).

The maximum likelihood classification rule says (1) calculate the probability of observing the smells of that suitcase given that there *are* illegal substances in the suitcase, and the probability of observing the same smells given that there *are no* illegal substances, and (2) bark (assign the suitcase to the positive class) if the probability of the smell given that there *are* illegal substances is higher. In other words, the rule says assign X_{n}+_{1} to class *k* such that P(X_{n}+_{1}|Y_{n}+_{1} = *k,* 0) is maximized.

On the other hand, the MAP classification rule says: (1) Calculate the posterior probability that there *are* illegal substances in the suitcase given the smells that were observed. (2) Bark (assign the suitcase to the positive class) if the posterior probability is greater than 0.5 (since there are only two choices). In other words, assign X_{n}+_{1} to class *k* such that *P(Y _{n}+_{1}* = k|X

_{n}+

_{1}, 0) is maximized.

As you can imagine, other classification rules are possible for probabilistic classification models and absolutely necessary for classifiers that don’t have a probabilistic interpretation. Furthermore, even if a classifier has a probabilistic interpretation, it might not be feasible to compute the likelihood or the posterior probabilities, so another rule will be needed. A simple example of a nonprobabilistic classification rule is “nearest neighbour” classification. It says—assign the new observation to the class of the observation that is closest in the training set.

## SOME POPULAR CLASSIFICATION RULES

*ML:* Assign to class *k,* such that *P(X _{n+1}Y_{n+1} = k)* is maximized.

*MAP:*Assign to class k, such that P(Y

_{n+1}= k|X

_{n}+-|) is maximized.

*Nearest-neighbor:*Assign to

*Y,*such that

*d(X*is minimized.

_{n}+_{b}X)