In the previous example, I suggested dividing the training set into 2 or 10 fractions, and running the classification on each part. In the limit, you get “leave-one-out” cross-validation where the classifier is trained leaving out each datapoint alone. The classification performance measures are then computed based on the left out datapoints, and summarized at the end. I hope it’s clear that leave-one-out cross-validation is making up for lack of data by increasing the computational burden: the parameters of the classifier are being re-estimated many times—proportional to the number of data points in the dataset. So, if a classification method needs a number of calculations proportional to the size of the dataset (say n) in order to estimate the parameters, the leave-one-out cross-validation estimate of performance therefore takes n2 calculations. Nevertheless, even in today’s “data-rich” molecular biology, we are usually data limited and not compute limited. So leave-one-out cross-validation is the most popular way to evaluate classification methods in molecular biology (and predictive machine-learning models in general). Because leave-one-out cross-validation uses almost the entire training set for each iteration, it is thought to give the most reliable estimate of the parameters, and therefore the best guess at how the classifier would perform on new data (if the whole training set was used). Figure 12.5 shows ROC curves for LDA and an SVM to classify cell type based on single-cell expression data from 100 highly expressed genes. Note that the classifiers both achieve very good performance, and there doesn’t seem to be an advantage to the nonlinear classification using the SVM. This suggests that although the data is 100-dimensional, in that high-dimensional space, the classes are linearly separable.
A very important cautionary note about cross-validation is that it only ensures that the classifier is not overfitting to the data in the training sample. Thus, the cross-validation estimates of the classification performance will reflect the performance on unseen data, provided that data has the same underlying distribution as the training sample. In many cases, when we are dealing with state-of-the-art genomics data, the data are generated from new technologies that are still in development. Both technical and biological issues might make the experiment hard to repeat. If any aspect of the data changes between the training sample and the subsequent
FIGURE 12.5 Leave-one-out cross-validation. In the left panel, the performance is shown for LDA and a support vector machine with a Gaussian kernel (SVM). The right panel shows leave-one-out cross-validation performance estimates on the original dataset (used for training and testing as in the left panel), and performance on a biological replicate where the model is trained on the original data and then applied to the replicate.
experiments, it is no longer guaranteed that the cross-validation accuracy will reflect the true classification accuracy on new data. In machine learning, this problem is known as “covariate shift” (Shimodaira 2000) to reflect the idea that the underlying feature space might change. Because the dimensionality of the feature spaces is large, and the distribution of the data in the space might be complicated, it’s not easy to pinpoint what kinds of changes are happening.
I can illustrate this issue using the single-cell RNA-seq data introduced in Chapter 2 because this dataset includes a “replicate” set of 96 LPS cells and 96 unstimulated cells. These are true biological replicates, different cells, sequenced on different days from different mice. When I train the classifier on the original set and now apply it to these “replicate” data, the classification performance is not nearly as good as the leave-one-out crossvalidation suggests it should be. ROC curves are shown in Figure 12.5. Note that this is not due to overfitting of the classifier (using a regularized model, such as penalized logistic regression does not help in this case). These are real differences (either biological or technical) between the two datasets, such that features associated with the cell class in one replicate are not associated in the same way in the second replicate. Because the problem is in a 100-dimensional space, it’s not easy to figure out what exactly has changed.