PIPELINES TO ANALYZE “OMICS” DATA
High throughput technologies allow researchers to monitor in parallel expression levels of thousands of genes or strength in DNA methylation at a vast number of CpG sites. However, raw “omics” data are usually filled with noise. For instance, both Illumina Infinium HumanMethyla- tion450 BeadChip (450K methylation array) and the Infinium Methy- lationEPIC BeadChip (850K methylation array) have different probe types, each using different chemistry. The process of bisulphite conversion of DNA, chip to chip variation, and other steps introduce assay variability and batch effects. Thus before we start analyzing the data, quality control and data preprocessing are needed. Commonly applied pipelines in general all start from quality control and preprocessing (Figure 1.1). Sometimes preprocessing is omitted if doing so removes biological relevance.
In this book, the focus is on genome-scale gene expression data and DNA methylation data. Quality control and data preprocessing are introduced and discussed in Chapters 2 and 3, along with a brief review of the data generating process. Due to the high dimensionality of data, it is necessary to exclude uninformative loci, i.e., loci potentially not associated with the study of interest. Doing so will dramatically improve statistical power in the detection of associations or identification of biological markers. However, we have to admit that screening can end up with false negatives, that is, excluding loci which are actually informative. The methods and examples for screening are discussed in Chapter 5. Which covers methods that utilize training and testing data, surrogate variables, and an approach of sure independence screening. All these approaches are built upon regressions.
Figure 1.1 An example of the pipeline to analyze “omics” data. The step of preprocessing can be omitted if this step potentially removes biological relevance.
Data mining is an important step to identify underlying features and patterns in the data. Findings from this step will substantially benefit subsequent in-depth analyses. Commonly used data mining techniques include cluster analyses, factor analyses, and principal component analyses. We focus on cluster analyses in this book. Compared to factor and principal component analyses, results from cluster analyses allow researchers to concretely visualize profiles of each cluster and identify the uniqueness of each cluster. In addition, interpretation of results from cluster analyses is relatively straightforward. Cluster analyses are discussed in Chapter 5, including classical approaches such as partitioning-based methods and hierarchical clustering approaches and joint clustering methods where the clustering is two-dimensional.
Often, after screening or data mining, in-depth analyses are conducted. Standard approaches such as linear regressions or generalized linear regressions are commonly used. However, even after screening,
Introduction ■ 3
the number of loci (or variables) can still be large. Thus, efficiently detecting markers of an exposure or for a health outcome is critical in medical research. To this end, accompanied by concrete examples, this book focuses more on methods to detect markers or select important variables. Chapters 6 and 7 introduce variable selection techniques in linear and non-linear models, and Chapter 8 discusses methods and examples for network constructions and comparisons. In some studies, DNA methylation is treated as a mediator between an exposure and a phenotype of interest. In this case, mediation analyses via path analyses can be applied. This is not covered by this book but recent studies have proposed methods to assess such mediation effects.
Through out the book, we utilize simulated data as well as real gene expression and DNA methylation data to demonstrate the analytical methods. All the programs in this book have been tested in R version
3.5.3 and/or 3.6.1. In the following sections of this chapter, we briefly introduce each real data set.
RNA-Seq GENE EXPRESSION IN S2-DRSC CELLS
Brooks et al.  conducted a stud)' aiming to explore the conservation of the splicing code between distantly related organisms, in particular, between Drosophila and mammals. To identify regulatory targets of Pasilla, S2-DRSC cells were treated with a 444 bp dsRNA fragment corresponding to the ps mRNA sequence. Untreated S2-DRSC were cultured in biological triplicate to serve as a control. The authors combined RNAi and RNA-Seq to identify exons regulated by Pasilla, the Drosophila melanogaster ortholog of mammalian NOVA1 and NOVA2. The RNA-Seq data for the treated and untreated cells and related information are available from the Gene Expression Omnibus (GEO) database under accession numbers GSM461176-GSM461181. The data are available at https://figshare.eom/s/e08e71c42fll8dbe8be6. The reads were aligned to the Drosophila reference genome and summarized at the gene level. This RNA-Seq data set is utilized to demonstrate the methods and packages for clustering in Chapter 5.
MICROARRAY GENE EXPRESSION IN YEAST CELLS AND IN PROSTATE SAMPLES
Zhao et al.  examined three microarray gene expression data sets across the yeast cell cycle and identified 254 genes that are periodic in at least two data sets. Expressions of genes with periodicity were further examined in the study by [127. 114] to demonstrate their developed methods and R packages for clustering. They extracted expression data of 256 genes collected in the first 16 time points with 7-minute- intervals. As noted in Qin et al. , expressions of these 256 genes are cell cycle dependent. A subset of the data with 64 genes is analyzed in Chapter 5.
We also examined another microarray gene expression data set discussed in Singh et al. . The raw data set has Affymetrix expressions of 52 tumoral and 50 non-tumoral prostate samples. A set of preprocessing steps were applied, including setting thresholds at 10 and 16, 000 units, excluding genes with expression variation less than 5-fold relatively or less than 500 units absolutely between samples, applying a base 10 log-transformation, and standardizing each experiment to zero mean and unit variance across the genes. In the package depthTools. expressions of 100 genes were included, representing the most variables genes in expression as noted in Dudoit et al. . The variation was measured as a ratio of between-group to within-group sum of squares in expression of genes. This data set is discussed in Chapter 8.
DNA METHYLATION IN NORMAL AND COLON/RECTAL ADENOCARCINOMA SAMPLES
DNA methylation of 38 matched pairs is available in the Cancer Genome Atlas (TCGA) data repository. Among the 76 samples, 38 have colon and rectal adenocarcinoma. This data set is available in the DMRcate package used to identified differentially methylated regions. In this book, we utilize this data set to demonstrate methods in marker detections and variable/feature selections (Chapters 6 and 7).
Genome-scale gene expression data
For genetic data, we mainly focus on gene expression data produced via Sanger sequencing and next generation sequencing. For epigenetic data, which will be discussed in the next chapter, we focus on DNA rnethy- lation of CpG sites. Single nucleotide polymorphisms in genomewide association studies will not be discussed.
MICROARRAY GENE EXPRESSION DATA
Different techniques are available to measure genome-scale gene expressions, for example, cDNA spotted arrays and oligonucleotide arrays (Figure 2.1). For cDNA spotted array, it is the first type of DNA microarray technology developed in the Brown and Botstein Labs at the Stanford University . It was produced by using a robotic device, which deposits a library of thousands of distinct cDNA clones onto a coated microscope glass slide surface in serial order with a distance of approximately 200-250 pm from each other, one spot for one gene. These moderate sized glass cDNA microarrays also bear about 10,000 spots or more on an area of 3.6 cm2. Then mRNA samples or targets from two groups (e.g., treatment and control samples) were extracted, separately reverse transcribed into c-DNA, and labeled with different fluorescent dyes (e.g., red color Cyanine-5 or Cy5 and green color Cyanine-3 or Cy3). The mixture of these labeled cDNA were hybridized onto the microarray, competing to bind to the cDNA probes. After hybridization, the slides are imaged using a scanner or a charge-coupled device camera to obtain
Figure 2.1 The platforms of cDNA spotted arrays and oligonucleotide arrays (adopted from Wilson et al. ).
fluorescence intensities for each dye at each spot on the array. The ratio of red and green fluorescence intensities for each spot is expected to indicate the relative abundance of the corresponding molecule in the two target samples.
The oligonucleotide array technology has been commonly used to measure genome-scale expression levels. The technology was first developed by Fordor et al. . Affymetrix GeneChip arrays further pioneered this technology and produces high density oligonucleotide based
DNA arrays . The basic principles of manufacturing Affymetrix’s GeneChips is that it uses photolithography and combinatorial chemistry to manufacture short single strands of DNA onto 5-inch square quartz chips. Unlike spotted cDNA arrays, the genes on the chip are designed based on sequence information alone. Each gene is represented by multiple short probes used to measure gene expression levels. Specifically, 11 to 20 perfect match (PM) and mismatch (MM) probe pairs are used to represent each gene, and PM-MM intensity differences are averaged for all probe pairs in a probe set to index expression level for each target gene.
In different microarray analyses, regardless of the platforms, typically either a one-color or two-color design will be used to measure mRNA abundance. A one-color design involves the hybridization of a single sample to each microarray after it has been labeled with a single fluorophore (for example, phycoerythrin, Cy3 or Cy5). In a two-color design, two samples (e.g., experimental and control) are labeled with different fluorophores (usually Cy3 and Cy5 dyes) and hybridized together on a single microarray. Although two-color designs have the potential to bring in bias and larger variations, techniques such as dye-reversed replicates (dye swaps or fluorophore reversals) can substantially improve the accuracy and sensitivity of gene expression measures. Compared to two-color designs, the advantage of one-color designs exists in their simplicity and flexibility. Furthermore, via biological and technical replicate assays, one-color designs can reduce data inconsistency across assays due to multiple sources of variability such as handling and processing .