Chapters 2, 3, and 4 review and introduce mathematical formalism, probability theory, and statistics that are essential to understanding the modeling and machine learning approaches used in contemporary molecular biology. Finally, in Chapters 5 and 6 the first real “machine learning” and nontrivial probabilistic models are introduced. It might sound a bit daunting that three chapters are needed to give the necessary background, but this is the reality of data-rich biology. I have done my best to keep it simple, use clear notation, and avoid tedious calculations. The reality is that analyzing molecular biology data is getting more and more complicated.

You probably already noticed that the book is organized by statistical models and machine learning methods and not by biological examples or experimental data types. Although this makes it hard to look up a statistical method to use on your data, I’ve organized it this way because I want to highlight the generality of the data analysis methods. For example, clustering can be applied to diverse data from DNA sequences to brain images and can be used to answer questions about protein complexes and cancer subtypes. Although I might not cover your data type or biological question specifically, once you understand the method, I hope it will be relatively straightforward to apply to your data.

Nevertheless, I understand that some readers will want to know that the book covers their type of data, so I’ve compiled a list of the molecular biology examples that I used to illustrate methods.


  • 1. Chapter 2—Single-cell RNA-seq data defies standard models
  • 2. Chapter 2—Comparing RNA expression between cell types for one or two genes
  • 3. Chapter 2—Analyzing the number of kinase substrates in a list of genes
  • 4. Chapter 3—Are the genes that came out of a genetic screen involved in angiogenesis?
  • 5. Chapter 3—How many genes have different expression levels in T cells?
  • 6. Chapter 3—Identifying eQTLs
  • 7. Chapter 4—Correlation between expression levels of CD8 antigen alpha and beta chains
  • 8. Chapter 4—GC content differences on human sex chromosomes
  • 9. Chapter 5—Groups of genes and cell types in the immune system
  • 10. Chapter 5—Building a tree of DNA or protein sequences
  • 11. Chapter 5—Immune cells expressing CD4, CD8 or both
  • 12. Chapter 5—Identifying orthologs with OrthoMCL
  • 13. Chapter 5—Protein complexes in protein interaction networks
  • 14. Chapter 6—Single-cell RNA-seq revisited
  • 15. Chapter 6—Motif finding with MEME
  • 16. Chapter 6—Estimating transcript abundance with Cufflinks
  • 17. Chapter 6—Integrating DNA sequence motifs and gene expression data
  • 18. Chapter 7—Identifying eQTLs revisited
  • 19. Chapter 7—Does mRNA abundance explain protein abundance?
  • 20. Chapter 8—SAG1 expression is controlled by multiple loci
  • 21. Chapter 8—mRNA abundance, codon bias, and the rate of protein evolution
  • 22. Chapter 8—Predicting gene expression from transcription factor binding motifs
  • 23. Chapter 8—Motif finding with REDUCE
  • 24. Chapter 9—Modeling a gene expression time course
  • 25. Chapter 9—Inferring population structure with STRUCTURE
  • 26. Chapter 10—Are mutations harmful or benign?
  • 27. Chapter 10—Finding a gene expression signature for T cells
  • 28. Chapter 10—Identifying motif matches in DNA sequences
  • 29. Chapter 11—Predicting protein folds
  • 30. Chapter 12—The BLAST homology detection problem
  • 31. Chapter 12—LPS stimulation in single-cell RNA-seq data
< Prev   CONTENTS   Source   Next >