Another way to develop a nonlinear classification boundary is to use several linear classifiers to approximate the nonlinear one. By applying simple classifiers sequentially, it’s possible to build up a very accurate nonlinear one. “Decision trees” are one of the simplest ways to do this.
Decision trees not only use linear classifiers at each step, but they restrict these to be univariate (one-dimensional). Thus, a decision tree corresponds to a series of simple “cutoffs” at each step. These cutoffs are arranged hierarchically, and because at each stage a datapoint be either above or below the cutoff (the “decision”), the series of cutoffs form a bifurcating tree (Figure 11.6).
Clearly, decision trees represent a great way to simplify the classification of complex data using simple-to-understand rules. The problem,
FIGURE 11.6 Decision trees. (a) An example of a four-layer decision tree to classify the data in (b). In (a), the numbers (i)-(iv) correspond to the decision boundaries (dotted lines) in (b).
of course, is how we chose the rules—in other words, how do we train the classifier? The classical approach to do this is an algorithm known as CART (Breiman et al. 1984). The idea is to choose the cutoff at each stage that maximizes the “purity” of the training data on either side of the cutoff. There are different ways to measure the “purity,” but in the popular CART algorithm, it was done by minimizing the “Gini impurity,” which is
k indexes the classes (typically positive or negative)
f is the fraction of the datapoints in the class
So at each step the CART algorithm chooses a cutoff in one dimension by minimizing I for that dimension. As usual, the details turn out to matter, so let’s write out the CART algorithm a bit more formally:
Step 1: For each dimension of the data, choose the cutoff that maximizes the purity of the training data.
Step 2: Choose the dimension that did the best in step 1.
Step 3: Repeat steps 1 and 2 until there is no cutoff that improves the purity more than some threshold.
There are two key issues with CART (and decision trees in general). First, the tree has to be built up one step at a time. It’s easy to think of examples where there are several good choices of which dimension to choose for the first step, but later on it becomes clear that one of the dimensions would have been better. Unfortunately, once the first choice is made by the algorithm, it can’t go back. Thus, CART is a “greedy” algorithm—it maximizes the objective function in each step, but doesn’t necessarily find the global optimum. The second problem with decision trees is overfitting— in general we don’t know how big an improvement to the impurity we need to make in order to justify adding an extra level to the hierarchy. In the traditional decision trees implementations, this is addressed using a so-called “pruning” step where cross-validation is used to remove lower levels of the hierarchy that do not improve the classification accuracy on the test data. Pruning of the lower levels of the tree is straightforward, but for complex trees with multiple levels, overfitting may occur at multiple levels and be difficult to prune.