WHAT WILL THIS BOOK COVER?
This book aims to give advanced students in molecular biology enough statistical and computational background to understand (and perform) three of the major tasks of modern machine learning that are widely used in bioinformatics and genomics applications:
- 1. Clustering
- 2. Regression
- 3. Classification
Given a set of data, clustering aims to divide the individual observations into groups or clusters. This is a very common problem in several areas of modern molecular biology. In the genomics era, clustering has been applied to genome-wide expression data to find groups of genes with similar expression patterns; it often turns out that these genes do work together (in pathways or networks) in the cell and therefore share common functions. Finding groups of similar genes using molecular interaction data can implicate pathways or help lead to hypotheses about gene function. Clustering has therefore been applied to all types of gene-level molecular interaction data, such as genetic and physical protein interactions. Proteins and genes that share sequence similarity can also be grouped together to delineate “families” that are likely to share biochemical functions. At the other end of the spectrum, finding groups of similar patients (or disease samples) based on molecular profiles is another major current application of clustering.
Historically, biologists wanted to find groups of organisms that represented species. Given a set of measurements of biological traits of individuals, clustering can divide them into groups with some degree of objectivity. In the early days of the molecular era, evolutionary geneticists obtained sequences of DNA and proteins wanting to find patterns that could relate the molecular data to species relationships. Today, inference of population structure by clustering individuals into subpopulations (based on genome-scale genotype data) is a major application of clustering in evolutionary genetics.
Clustering is a classic topic in machine learning because the nature of the groups and the number of groups are unknown. The computer has to “learn” these from the data. There are endless numbers of clustering methods that have been considered, and the bioinformatics literature has contributed a very large number of them.