MEASURES OF CLASSIFICATION PERFORMANCE
Perhaps the most obvious and familiar of performance measures is the accuracy. This simply measures the number of correct predictions (or classifications) as a fraction of the total number of predictions. In practice, however, whether the accuracy is the most useful measure of classification performance depends on the problem. For example, consider a sniffer dog that has fallen asleep at the airport and lets all the suitcases through. If the number of illegal substances in the suitcases is very small say 1% of the suitcases, the sleeping dog will still be a very accurate classifier: The only mistakes will be on the suitcases with illegal substances (1%). The sleeping dog’s accuracy will be 99%, but I hope it’s clear that this is not the kind of behavior we want. In this case, we might be willing to accept a few mistakes on the suitcases without illegal substances, if the dog is very accurate on the small fraction of suitcases that actually have something bad inside. We might accept a lower accuracy overall, if we could be confident we were catching more of the illegal suitcases. This is a case where the “positives” (illegal-substance-containing suitcases) are rare, but more important than the “negatives” (legal suitcases). On the other hand, if positives and negatives are of similar importance and frequency, the accuracy of the classifier might be a good measure of performance.
My purpose in bringing up the example of rare, important positives is not purely pedagogical. In fact, in many molecular biology and genome biology applications, we are in just such a situation. For example, consider the classic BLAST (Altschul et al. 1990) homology detection problem: We seek to classify which sequences from the database are homologous to the query sequence. In this case, the number of homologous sequences (positives) is a tiny fraction of the database. Identifying a small number of homologues accurately is much more important than misidentifying sequences that are not homologous to the query.
In cases where positives are very important, as is typically the case in genome-scale molecular biology applications, people typically consider the “sensitivity” or “true positive rate (TPR)” or “recall.” This is just the fraction of the positives that are out there that were successfully identified. I hope it’s clear that the sleeping dog would get 0% in this measure (despite an accuracy of 99%). However, there is also an extreme case where the TPR is not a useful performance measure: the hyperactive sniffer dog. This dog simply barks constantly and classifies every suitcase as containing illegal substances. This classifier will achieve TPR of 100%, because it will classify the rare positives correctly as positives. However, as I’m sure you already figured out, the accuracy of this dog will be terrible—it will only get the positives right, and therefore the accuracy will be 1%, or 99 times worse than the sleeping dog. This is the bioinformatics equivalent of a sequence database search that predicts every sequence to be a homolog of the query. This is of no practical utility whatsover.
Thus, although it might be what we care about, on its own, TPR is not a very useful measure of classification performance. Luckily, TPR can be combined with other performance measures that improve its utility. One measure that TPR is often combined with is the “false positive rate (FPR).” The FPR is the number of truly negative datapoints that were (falsely) predicted to be positive. Notice how this will reign in the hyperactive sniffer dog—now we are keeping track of how often a positive prediction is made when there is no illegal substance in the suitcase. To fairly compare classifiers, we choose the parameters to produce an FPR that is acceptable to us (say 5% based on the validation set) and then ask which classifier has a better TPR on the test set. In the case of the airport suitcases, where only 1 in 100 suitcases are illegal, FPR of 5% corresponds to misidentifying five legal suitcases for every one that contains an illegal substance.
A low FPR is of critical importance for any classifier that is to be applied to large amounts of data: If you are trying to classify millions of data points and the true positives are very rare, even an FPR of 5% will leave you with thousands of false positives to sort through. For a BLAST search, 5% FPR would leave you with thousands of nonhomologous sequences to look at before you found the real homolog. As, I hope is clear from this example, the FPR is related to the false discovery rate (FDR) in the multiple testing problem that we discussed in Chapter 3.
Although the accuracy (ACC), FPR, and TPR are among the more widely used measures of classification performance, there are many other measures that are used. In some cases, different measures of performance might be more relevant than others.