Regression aims to model the statistical relationship between one or more variables. For example, regression is a powerful way to test for and model the relationship between genotype and phenotype. Contemporary data analysis methods for genome-wide association studies (GWAS) and quantitative trait loci for gene expression (eQTLs) rely on advanced forms of regression (known as generalized linear mixed models) that can account for complex structure in the data due to the relatedness of individuals and technical biases. Regression methods are used extensively in other areas of biostatistics, particularly in statistical genetics, and are often used in bioinformatics as a means to integrate data for predictive models.
In addition to its wide use in biological data analysis, I believe regression is a key area to focus on in this book for two pedagogical reasons. First, regression deals with the inference of relationships between two or more types of observations, which is a key conceptual issue in all scientific data analysis applications, particularly when one observation can be thought of as predictive or causative of the other. Because classical regression techniques yield straightforward statistical hypothesis tests, regression allows us to connect one type of data to another, and can be used to compare large datasets of different types. Second, regression is an area where the evolution from classical statistics to machine learning methods can be illustrated most easily through the development of penalized likelihood methods. Thus, studying regression can help students understand developments in other areas of machine learning (through analogy with regression), without knowing all the technical details.