Classification is the task of assigning observations into previously defined classes. It underlies many of the mainstream successes of machine learning: spam filters, face recognition in photos, and the Shazam app. Classification techniques also form the basis for many widely used bioinformatics tools and methodologies. Typical applications include predictions of gene function based on protein sequence or genome-scale experimental data, and identification of disease subtypes and biomarkers. Historically, statistical classification techniques were used to analyze the power of medical tests: given the outcome of a blood test, how accurately could a physician diagnose a disease?
Increasingly, sophisticated machine learning techniques (such as neural networks, random forests and support vector machines or SVMs) are used in popular software for scientific data analysis, and it is essential that modern molecular biologists understand the concepts underlying these. Because of the wide applicability of classification in everyday problems in the information technology industry, it has become a large and rapidly developing area of machine learning. Biomedical applications of these methodological developments often lead to important advances in computational biology. However, before applying these methods, it’s critical to understand the specific issues arising in genome-scale analysis, particularly with respect to evaluation of classification performance.