CLASSIFICATION BOUNDARIES AND LINEAR CLASSIFICATION
All classification methods aim to find a boundary between groups of data- points that separates the different classes in the high-dimensional space of observations, often referred to as the “feature space.” Linear classification methods aim to find a boundary between classes that can be expressed as a linear function of the observations in each dimension (often referred to as the “features”). For example, let’s consider a typical molecular biology task of classifying cell types based on gene expression patterns; in this case, we’ll try to classify T cells from other cells based on gene expression data from ImmGen (Figure 10.1).
Notice that one of the linear classification boundaries corresponds to a horizontal line. This means that only the data on the vertical axis are being used for classification—a one-dimensional classification. This is equivalent to simply ranking the cells by one of the expression levels and choosing a cutoff—above the cutoff we say everything is a T cell. I hope it’s clear that by using both dimensions (a diagonal line) we’ll be able to do a slightly better job of separating the T cells (+) from all others. You can see that even though the dimensions are highly correlated, there is still a lot of information gained by using both of them. As we go from a 1-D to a 2-D classifier, we’ll have to train an additional parameter to write the equation for the linear classification boundary as opposed to simple
FIGURE 10.1 Decision boundaries in a two-dimensional feature space. Possible decision boundaries for classification of T cells based on gene expression levels of the CD8 alpha and beta chain.
cutoff. All linear classification methods will draw a line (or plane in higher dimensions) through the data, although they will disagree about what the line should be.
Of course, we might be able to do better by choosing a curve (or more generally a nonlinear boundary) to separate the classes. In this example, we can reduce the false positives (negatives on the positive side of the boundary; we will return to evaluation of classifiers in Chapter 12) by using an ellipse as the classification boundary instead of a line. However, as we go to more complicated high-dimensional nonlinear functions, we’ll need more and more parameters to specify the classification boundary. We will return to nonlinear classification boundaries, so-called nonlinear classification, in Chapter 11. In general, in designing machine-learning methods for classification, we will face a trade-off between choosing the most accurate, possibly complex classification boundary versus a simpler (fewest parameters and dimensions) boundary whose parameters we can actually train.