EVALUATING CLASSIFIERS WHEN YOU DON'T HAVE ENOUGH DATA
So far, we’ve been assuming that the classifiers have been trained and the parameters have been set based on the training and validation data, and the performance measures (e.g., ROC curve) are computed on the test data that was never seen by the classifier before.
In practice, there are many cases when we simply don’t have enough data to train and evaluate classifiers this way. Typically, when we are considering high-dimensional data, we need to estimate a few parameters for each dimension. We might try to model the correlation between dimensions to avoid getting distracted by features that are not really helping us. To do all of this, we need data. To train a classifier, a conservative rule of thumb is that you need 10 datapoints per parameter. You might be thinking—data, no problem—that’s what genomics is all about. I’ve got thousands of measurements in every experiment. However, remember that in the classification world, there are parameters associated with each class, so we need sufficient observations from each class in order for the classifier to learn to recognize them. This becomes particularly problematic when the class we are trying to find is rare: say, we only know of 10 genes that are important for nervous system development and we want to identify more. Although we can use the “big data” to easily train parameters the genes that are not important for nervous system development (the negative class), we will also need enough known positive examples to train and evaluate the classifiers. If we use all 10 known genes for our training set, then we have nothing left to use in order to evaluate the classifier.
Amazingly, it is still possible to train a classifier even when you don’t have enough data. The main way this is done is through cross-validation. (We have already seen cross-validation in the context of choosing penalties for regularized regression.) The idea of cross-validation is to do many iterations of the “ideal” classification set up, where you leave out a fraction of the data, train on the rest, and evaluate on the left out data. It’s important to stress that classification performance should only be evaluated on held-out (or test) data. This is because the number of parameters in modern machine learning methods can be very large. If a classifier has enough complexity (e.g., enough parameters), it will be able to predict the training data perfectly. In general, we don’t care how well a classifier can do on the training data—those are the suitcases where we already know what’s inside. The classifier is usually only of interest for how well it can predict new examples. Although in the simple prediction problems used for illustration purposes here, it might seem easy to clearly delineate the training data from the test data, in complicated genomics and bioinformatics applications, there are typically many steps of data analysis. It’s easy for (even very experienced) researchers to forget that some of the early steps of the analysis “saw” the entire dataset. In these cases, crossvalidation performed on later steps of the analysis will overestimate the accuracy of the classification model (Yuan et al. 2007). In general, we have to be vigilant about separating unseen data and ensuring that it remains unseen throughout the data analysis.
To illustrate the magic of cross-validation, let’s try to classify cells based on single-cell RNA expression data. As we saw in Chapter 2, the expression levels measured in these experiments fit very poorly to standard probability distributions. To simplify the problem somewhat, we’ll start by trying to classify 96 LPS cells (positives) from 95 unstimulated cells (negatives) based on gene expression levels for the 100 genes with the highest average expression levels (over the entire dataset). We will start with twofold crossvalidation, which means that we have randomly divided the data into two equal parts. We use one to estimate the parameters of the classifier, and then compute an ROC curve to evaluate the classifier on the part of the data that were left out.
Figure 12.4 illustrates ROC curves for twofold cross-validation. In each case, the ROC curve was computed on both the training and test data to highlight the importance of evaluating the classfier on the held out data. For LDA (Linear Discriminant Analysis, discussed in Chapter 10) and SVM (Support Vector Machines, discussed in Chapter 11), perfect classification is achieved on the training set in every case. However, for the test set, the classification accuracy is much lower. This illustrates how classification accuracy tends to be inflated for the training set, presumably due to overfitting.
If the difference in classification performance between the test and training set is really due to overfitting, it should be possible to reduce this difference using regularization (as discussed in Chapter 9). Indeed, Figure 12.4 shows that using penalized logistic regression greatly decreases the
FIGURE 12.4 ROC curves on the training and test set for twofold crossvalidation using various classification methods. In each panel, the performance on the training data is shown for the two random samples of the data in black traces, while the performance on the held-out samples is shown in the gray traces. Note that the classification performance depends on the particular random set selected in each cross-validation run, and is therefore a random variable. In the upper panels, no regularization is used and the classification models achieve perfect separation on the training set, but much worse performance on the test sets. In the bottom panels, regularization is used, and the training performance is similar to the test performance.
difference between the training and test sets. However, if the penalty chosen is too large, the classification performance is not as good on either the training or test set.
Note that in this example, I divided the data into two equal parts and we could look at the classification performance using ROC curves. In general, half of the data might not be enough to train the model, and you might want to use 90% of the data for training and leave only 10% for testing. In that case, you would want to repeat the analysis 10 times, so that each part of the data is used as the test set once. Since it’s not easy to look at 10 ROC curves, instead you can pool all of the results from the left-out data together and make one ROC curve.