Throughout the discussion of classification so far, we have been assuming that there are two classes, positives and negatives. In reality, of course, there are situations where we have more than two classes (often referred to as multiclass classification). Some of the linear methods discussed here (e.g., logistic regression) generalize in a totally straightforward way to the multiclass case (see Exercises). For others (e.g., SVMs), it’s less clear how well they will work with more than two classes.
A classic multiclass problem in bioinformatics is classification of 3-D protein folds based on amino acid sequence. Based on extensive inspection of known 3-D structures, proteins can be classified into one of more than a thousand “folds,” including things like Globin-like, ^-sandwich, and TIM-barrel (SCOP release 1.75, Andreeva et al. 2007). Having sequenced a genome and predicted protein sequences based on the genetic code, researches might wish to know which (if any) of the previously known 3-D folds a novel protein belongs in. You can see why this multiclass situation is much more difficult than the binary classification problems that we’ve been considering so far: to apply a probabilistic model to this problem, we need to train a model for each of the folds. In this problem, that means thousands of models, as opposed to just two in the binary classification setting.
One relatively straightforward (but inelegant) way to deal with the multiclass classification problem is to break down the problem into a series of binary decisions and train classifiers for each of them. The most typical way to do this is using the “one-versus-all” approach where a classifier is trained for each class using the observations of that class as the positives and the positives from all the other classes as negatives. For example, to train a classifier for ^-sandwich proteins, an SVM could be trained to distinguish between the known ^-sandwich proteins and all other classes. This yields a binary classifier for each type of positive. New examples are then tested against the classifiers for each class. Although this approach can work well in many cases, the main problem is that often more than one of the classifiers will identify the new example as a positive. In this case, the user doesn’t know which class to assign it to. It’s important to note that in some cases, assigning a new datapoint to multiple classes might not be a problem, in which case the one-versus-all approach might be just what you want. Indeed, since proteins can contain multiple protein domains, the Pfam database uses a series of thousands of Hidden Markov Models or HMMs, where each one can potentially identify a positive in a new query protein.
An alternative to the “one-versus-all” approach is the so-called “all- versus-all” approach. In this case, a binary classifier is trained to distinguish between each pair of classes. In the protein-fold example, this means that an SVM would be trained to distinguish ^-sandwich proteins from Globin-like, a second SVM would be trained to distinguish в-sandwich from TIM-barrel, and so-forth for every pair of classes. A new example is then compared to all the classifiers and assigned to the class which the most pairwise classifiers found positive. In the ideal case, a new в-sandwich protein is correctly identified by all the в-sandwich versus other individual SVMs, and all the other SVMs choose randomly. This approach was used for classification of protein folds and produced greatly improved results relative to the “one-versus-all” approach (Ding and Dubchak 2001). However, the major drawback of this approach was that the number of binary classifiers needed scales as the number of pairs of classes, lAm(m - 1), where m is the number of classes. This means that as the number of protein folds approached 1,000, the number of pairwise SVMs needed approached 50,000.
In practice, for current protein fold predictions, this problem is actually handled is using nearest neighbor-style approaches—nearest neighbor classification doesn’t require a model for the classes, only a means to compare the new datapoint to the previous observations. Sophisticated bioinformatics pipelines are used identify the most similar proteins whose folds are known, and the new protein’s fold is inferred based on those. The key step becomes finding computationally efficient ways to compare the new protein to the entire database of known structures. Modern protein fold predictors such as Phyre (Kelly and Sternberg 2009) rely on huge sequence and structure databases and large compute clusters—users must submit their proteins for analysis via the World Wide Web.