WHY IS THIS A BOOK?
I’ve been asked many times by students and colleagues: Can you recommend a book where I can learn the statistics that I need for bioinformatics and genomics? I’ve never been able to recommend one. Of course, current graduate students have access to all the statistics and machine learning reference material they could ever need via the Internet. However, most of it is written for a different audience and is Impenetrable to molecular biologists. So, although all the formulas and algorithms in this book are probably easy to find on the Internet, I hope the book format will give me a chance to explain in simple and accessible language what it all means.
Historically speaking, it’s ironic that contemporary biologists should need a book to explain data analysis and statistics. Much of the foundational work in statistics was developed by Fisher, Pearson, and others out of direct need to analyze biological observations. With the ascendancy of digital data collection and powerful computers, to say that data analysis has been revolutionized is a severe understatement at best. It is simply not possible for biologists to keep up with the developments in statistics and computer science that are introducing ever new and sophisticated computer-enabled data analysis methods.
My goal is that the reader will be able to situate their molecular biology data (ideally that results from the experiments they have done) in relation to analysis and modeling approaches that will allow them to ask and answer the questions in which they are most interested. This means that if the data really is just two lists of numbers (say, for mutant and wt) they will realize that all they need is a t-test, (or a nonparametric alternative if the data are badly behaved.)
In most practical cases, however, the kinds of questions that molecular biologists are asking go far beyond telling if mutant is different than wild type. In the information age, students need to quantitatively integrate their data with other datasets that have been made publically available;
they may have done several types of experiments that need to be combined in a rigorous framework.
This means that, ideally, a reader of this book will be able to understand the sophisticated statistical approaches that have been applied to their problem (even if they are not covered explicitly in this book) and, if necessary, they will have the tools and context to develop their own statistical model or simple machine learning method.
As a graduate student in the early 00s, I also asked my professors for books, and I was referred (by Terry Speed, a statistical geneticist Dudoit (2012)) to a multivariate text book by Mardia, Kent, and Bibby, which I recommend to anyone who wants to learn multivariate statistics. It was at that time I first began to see statistics as more than an esoteric collection of strange “tests” named after long-dead men. However, Mardia et al. is from the 1980, and is out of date for modern molecular biology applications. Similarly, I have a copy of Feller’s classic book that my PhD supervisor Mike Eisen once gave to me, but this book really isn’t aimed at molecular biologists—P-value isn’t even in the index of Feller. I still can’t recommend a book that explains what a P-value is in the context of molecular biology. Books are either way too advanced for biologists (e.g., Friedman, Tibshirani, and Hastie’s The Elements of Statistical Learning or MacKay’s Information Theory, Inference, and Learning Algorithms), or they are out of date with respect to modern applications. To me the most useful book is Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids by Durbin et al. (1998). Although that book is focused on bioinformatics, I find the mix of theory and application in that book exceptionally useful—so much so that it is still the book that I (and my graduate students) read 15 years later. I am greatly indebted to that book, and I would strongly recommend it to anyone who wants to understand HMMs.
In 2010, Quaid Morris and I started teaching a short course called “ML4Bio: statistical modeling and machine learning for molecular biology” to help our graduate students get a handle on data analysis. As I write this book in 2016, it seems to me that being able to do advanced statistical analysis of large datasets is the most valuable transferrable skill that we are teaching our bioinformatics students. In industry, “data scientists” are tasked with supporting key business decisions and get paid big $$$. In academia, people who can formulate and test hypotheses on large datasets are leading the transformation of biology to a data-rich science.
REFERENCES AND FURTHER READING
Dudoit S. (Ed.). (2012). Selected Works of Terry Speed. New York: Springer.
Durbin R, Eddy SR, Krogh A, Mitchison G. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, 1st edn. Cambridge, U.K.: Cambridge University Press.
Feller W. (1968). An Introduction to Probability Theory and Its Applications, Vol. 1, 3rd edn. New York: Wiley.
Hastie T, Tibshirani R, Friedman J. (2009). The Elements of Statistical Learning. New York: Springer.
MacKay DJC. (2003). Information Theory, Inference, and Learning Algorithms, 1st edn. Cambridge, U.K.: Cambridge University Press.
Mardia K, Kent J, Bibby J. (1980). Multivariate Analysis, 1st edn. London, U.K.: Academic Press.