CLASSIFICATION PERFORMANCE STATISTICS IN THE IDEAL CLASSIFICATION SETUP
As we saw in Chapters 10 and 11, the machine learning community has already come up with many creative approaches to classification that can work in a wide variety of settings, so most of the time we can choose from what is already available and avoid inventing new classification methods. However, in many cases molecular biology experiments will yield new types of high-dimensional data with which we would like to train a classifier and use it on new experimental data. In general, we don’t know in advance which classifiers (let alone which parameters) to use in order to obtain good performance. Therefore, it’s of critical importance for molecular biologists to know how to train classifiers correctly and how to evaluate their performance.
In the ideal case (described in Chapter 10), the available data have been divided into three parts. The first part (training set) is used to estimate the parameters of the classification model by maximizing some objective function. The second part (validation set) is used to compare classifiers and choices of parameter that can’t be determined simply based on the training data—for example, choosing the classification cutoff, regularization penalty, kernels, neighbourhood size, k. The third part (test set) is data that was not used at all during the training stages and parameter selection. The performance of the classifier is measured using this test set (also sometimes called unseen or held-out data). There are number of performance measures that can be calculated, and we will discuss several in turn.