# Screening genome-scale genetic and epigenetic data

In Chapters 2 and 3. we introduced different types of data at the genome scale. Given the feature of their high dimensionality, directly analyzing data at the genome scale with complex analytical methods may lead to substantially under-powered results. Depending on research goals, screening genome-scale data based on their associations with a phenotype or health outcome will enable us to target at potentially important genes or DNA methylation (CpG) sites. For instance, for studies with smoke exposure as a major risk factor, it has been shown that methylation is affected by some known factors such as smoking [82]. CpG sites not influenced by such exposure status potentially will only bring noise into subsequent analyses and thus reduce statistical power. On the other hand, CpGs with DNA methylation not associated with a health outcome may not be of great interest either. Screening genetic and epigenetic variables or features at the genome scale has become an overwhelmingly important step in multiple fields of genetic and epigenetic studies such as cancer, obesity, and allergic diseases.

We focus on screening methods that are built upon associations with some post-hoc adjustment for multiple comparisons. We denote factors, such as smoke exposure status or a health outcome, used to exclude potentially unimportant genetic or epigenetic variables as “screening factors”. Most commonly used screening approaches built upon associations are through regressions. In this chapter, we introduce two directions of development in this filed; one mainly relies on training and testing data using the concept of cross validation, and the other one mainly focuses on correlations.

## SCREENING VIA TRAINING AND TESTING SAMPLES

When screening factors in regressions are included as independent variables, a primary limitation lies in effectively controlling for multiple testing. Two popular adjustment methods are the Bonferroni-based method [31, 32] and the Benjamini-Hochberg method for controlling the false discovery rate (FDR) [66, 7]. These methods alter the p-value or critical value to control for type I error. Bonferroni correction is the most conservative approach. An adjusted p-value is calculated by multiplying the linear regression p-value by the total number of comparisons (m) and variables with adjusted p-values less than the significance level a are treated as being statistically significant. The FDR method first orders the p-values, *P{k)* for *к* G l...m, such that lower ordered p-values that are less than or equal to *k/m* x *a* lead to rejection of the null hypothsis

[7]. It follows that the conservative Bonferroni-based method is not able to effectively control type II error while the FDR-based method cannot control type I error as desired.

In this section, we introduce a recently developed method that has the potential to control both types I and II errors. It utilizes existing statistical techniques to screen genome-scale variables. The method is built upon training and testing data. Even with data generated under the same mechanisms in the same population, associations between genetic and epigenetic variables and screening factors may vary from one data set to another, which directly impacts type I error rates, and indirectly leads to a loss of statistical power in subsequent analyses. Using training and testing data will improve the reproducibility of selected variables, and earn the ability to control for both types I and II errors [118].

This approach starts from randomly dividing the original data into two parts as training data and testing data, respectively. Following literature suggestions, in general, 2/3 of the data are included in the training data set to maximize statistical power. [27] Next, linear regressions are applied to the training data to calculate the p-values for the association between a variable (dependent variable) and screening factors (independent variables), for instance, DNA methylation at a CpG site and smoke exposure status. Ordinary least squares or robust regressions can be applied to infer the parameters. Compared to linear regressions, robust regressions have more relaxed assumptions about normality and presence of outliers in the data. A variable, e.g., a CpG site, is included as a candidate variable if the screen factor(s) of interest is statistically significant according to a pre-specified significance level, e.g., 0.05. A set of candidate variables are then selected by use of training data.

The candidate variables are further tested using the testing data with linear regressions. For one pair of training and testing data sets, a candidate variable is deemed as being important if the significance still holds in the testing data. This screening process will be repeated for a certain number of times, and at each time a pool of candidate variables are selected based on one randomly selected set of training and testing data.

The screening process is summarized in Figure 4.1 (the solid arrows). A variable is treated as being potentially important if it is selected by at least *m%* of the randomly selected sets of training and testing data. The value of *m* is pre-specified. A suggestion of choosing *m* has been given in the literature [118]. Basically, taking *m* close to 50 works the best with significance level of 0.05 for both the training and testing steps. If a higher significance level is chosen in the testing step, then a higher value for cutoff percentage should be used.

The R package, ttScreening, can be used to implement the above approach. It is originally designed for epigenetic data, in particular, DNA methylation data. The package can be applied to other genetic or epigenetic data as long as their measurements can be treated as being continuous. Two data sets are used as input. One data set is at the genome scale such that each row represents measures at one locus across all subjects. The other data set is composed of measurements of one or more risk factors or health outcomes. Since this approach is closely related to the method introduced in the next section, we hold our discussions on the implementation of this package till then.

## SCREENING INCORPORATING SURROGATE VARIABLES

Implementation of training and testing data introduces a potential to control both types I and II errors [118]. However, other issues arise when performing screening based on associations. Variations in data can be explained by some known factors but also by some unknown factors [93]. Accounting for variations introduced by unknown factors will potentially improve screening quality and efficiency. The method discussed below is an extension of that introduced in Section 4.1. It incorporates surrogate variable analysis [93], which identifies unknown latent variables, in conjunction with the training and testing approach noted earlier [27].

This approach consists of two consecutive components, surrogate variable analysis followed by utilization of the screening method discussed in Section 4.1. Surrogate variable analysis (SVA) aims to identify and estimate latent factors or surrogate variables (SVs) that potentially affect the association of a variable with a screening factor or screening factors, e.g., DNA methylation (variable) and smoke exposure status (a known screening factor) [93]. Including surrogate variables into the screening process has the potential to reduce unexplained variations in the variable, adjust for confounding effects, consequent 1}', improve the accuracy of screening in terms of importance of variables that pass the screening [93].

Figure 4.1 The flow chart of the screening algorithm that involves surrogate variables, adapted from Ray et al. [118].

Surrogate variables are inferred prior to screening using the method developed by Leek and Storey [93] (Figure 4.1, dashed arrows). These SVs are developed by removing the amount of variation in the variables due to screening factors and then decomposing the remaining residuals to identify an orthogonal basis of singular vectors that can be reproduced. These vectors are further examined for significant variation to form surrogate variables. Leek and Storey built an R package to perform

SVA. The first step in SVA is to identify the number of surrogate variables based on the data using one of two methods, an approach based on a permutation procedure originally proposed by Buja and Eyuboglu in 1992 [15], and the other provides an interface to the asymptotic approach proposed by Leek in 2011 ([91]). Once the number of surrogate variables is calculated, they are then estimated using one of three algorithms, the iteratively re-weighted, supervised, or two-step method. The iteratively re-weighted method is for empirical estimation of unknown factors, the supervised method is for the situation that control probes (i.e., probes such that genes are unlikely to be differential expressed) are known, and the two-step method estimates surrogate variables based on the subset of rows affected by unmodeled dependence [92]. Using this approach, conditional on the data, a number of latent unknown variables will be identified and estimated. These latent variables along with the known screening factors will be included in the training-and-testing-based method to detect potentially important variables. This approach has been built into the ttScreening.

In the following, we apply the ttScreening package to a simulated data set which has 2,000 variables and 2 observable screening factors with measures from 300 subjects. Of these 2,000 variables, the hist 100 are important variables. We start from simulating a data set with the settings noted above with the two variables simulated from normal distributions using rmvnorm.

> library(mvtnorm)

> # 300 subjects, the first 100 are important variables,

+ and 2000 candidate variables

> nsub = 300

> imp = 100

> num = 2000

> set.seed(l)

> # two observable screening factors

> meanx = c(l,l)

> Covx = diag(c(l,l))

> x = rmvnorm(nsub, meanx, Covx)

In addition, we simulate five “latent” screening factors, which will be included in the generation of the 2, 000 variables.

> # five latent screening factors

> surMean = c(0,3,0,2,0)

> CovSur = diag(rep(0.5,5))

> sur = rmvnorm(nsub, surMean, CovSur)

As for the association of variables with the screening factors *x,* we consider three settings such that some variables will be associated with both screening factors (Designl and Design2) but others only associated with the “latent” screening factors (DesignSur).

> Designl = as.matrix(cbind(rep(l,nsub),x,x[,l]*x[,2],sur))

> Design2 = as.matrix(cbind(rep(1,nsub),x[,1],x[,1]*x[,2],sur))

> DesignSur = as.matrix(sur)

The following several lines specify regression coefficients corresponding to each design,

> beta = c(0.5, rep(0.4,3))

> sbeta = rmvnorm(l,rep(0.5,5),diag(rep(0.01,5)))

>

> # beta matrix defining different associations with

> # observable and latent screening factors

> betaDesignl = as.matrix(c(beta,sbeta))

> betaDesign2 = as.matrix(c(beta[c(l,2,4)],sbeta))

> betaSur = as.matrix(t(sbeta))

Finally, with the design matrices and regression coefficients specified, data, y. for the 2,000 variables are then simulated using a multivariate normal distribution.

> impl.mu = matrix(Designl'/,*"/,betaDesignl ,nrow = nsub.ncol = (imp*0.9))

> imp2.mu = matrix(Design2'/,*"/,betaDesign2,nrow = nsub.ncol = (imp*0.1))

> noimp.mu = matrix(DesignSur‘/.*'/,betaSur,nrow = nsub,ncol = num-imp)

> mu.matrix = cbinddmpl.mu, imp2.mu, noimp.mu)

> CovErr = diag(rep(0.5,num)>

> error = rmvnorm(nsub,mean = rep(0,num),sigma = CovErr.method = "chol")

> у = t(mu.matrix+error)

We next apply ttScreening to screen variables based the interaction between the two screening factors in x. First, we only utilize training and testing data without inclusion of surrogate variables.

> runs = ttScreeningCy = y, formula = ~x[,1]+x[,2]+x[,1]:x[,2],

+ imp.var = 3, data = data.frame(x),

+ B.values = FALSE, iterations = 100, cv.cutoff = 50,

+ n.sv = 0, train.alpha = 0.05, test.alpha = 0.05,

+ FDR.alpha = 0.05, Bon.alpha = 0.05, percent = (2/3),

+ linear = "Is", vfilter = NULL, В = 5,

+ numSVmethod = "be", rowname = NULL.maxit = 20)

[1] "No surrogate variables used in the analysis"

>

> length(runs$TT.output[,1])

[1] 95

> length(runs$FDR.output[,1])

[1] 103

> length(runs$Bon.output[,1])

[1] 84

> sum(runs$TT.output [, 1]7,in°/_{t} seq(l,imp))

[1] 95

> sum(runs$FDR. output [, l]7_{0}in°/_{0} seq(l,imp))

[1] 100

> sum(runs$Bon. output [, l]7oin7_{t} seq(l,imp))

[1] 84

In the above function, ordinary least square regressions, set by linear = "Is", are used for the screening, *у* is the genome-scale data, x [, 1] and x [, 2] represent two screening factors, and imp. var=3 denotes that the screening is with respect to the interaction effects between the two factors. We can also choose to screen variables based on both x[, 1] and x [, 2]. in which case we set imp. var=c (1,2).

The argument iteration sets the number of times to repeat the training and testing procedure and cv.cutoff is used to set the minimum frequency required for a variable to be treated as an important variable. The higher the frequency, the more likely the variable is informative. The default value of cv.cutoff is 50. The two arguments train.alpha and test.alpha define the statistical significance levels used in the training and testing data, respectively, and their default values are both 0.05. We use percent to define the size of training data, and, in the above example, data from 2/3 of the subjects will be used.

Finally, argument n.sv=0 is used to indicate that the screening will not include any surrogate variables but only observable screening factors, i.e., x[,l] and x[,2]. Selected variables will be stored in the object TT.output along with frequencies for each selected variable. The first column of TT. output contains the variable names that pass the screening. In this example, 95 variables are selection and all these 95 variables are among the 100 important variables. The ttScreening package also provides users with access to other screening methods: FDR- and Bonferroni-based methods, and output can be accessed via FDR.output and Bon.output, respectively, with similar structure compared to TT.output (except for selection frequencies). The FDR- based approach selected 103 variables with 3 false positive selections.

The Bonferroni approach selected 84 and all of them are among the 100 important ones.

To include surrogate variables in the screening process, we set n.sv=NULL. The argument numSVmethod is to specify a method used to determine the number of surrogate variables, and "be" is the method by Buja and Eyuboglu [15]. The other approach is proposed by Leek and labeled as "leek" [91]. To specify the method used to estimate surrogate variables, sva.method is used, and in this example, the two-step approach is selected. The specifications of the other two approaches are "irw" for the iteratively re-weighted approach, and "supervised" for the supervised approach.

> library(ttScreening)

> runs = ttScreeningCy = y, formula = ~x[,1]+x[,2]+x[,1]:x[,2],

+ imp.var = 3, data = data.frame(x), sva.method = "two-step",

+ B.values = FALSE, iterations = 100, cv.cutoff = 50,

+ n.sv = NULL, train.alpha = 0.05, test.alpha = 0.05,

+ FDR.alpha = 0.05, Bon.alpha = 0.05, percent = (2/3),

+ linear = "Is", vfilter = NULL, В = 5,

+ numSVmethod = "be", rowname = NULL,maxit = 20)

Number of significant surrogate variables is: 1 [1] "1 surrogate variables used in the analysis"

>

> length(runsSTT.output[,1])

[1] 101

> length(runs$FDR.output[,1])

[1] из

> length(runs$Bon.output[,1])

[1] 100

> sum(runs$TT.output [, 1]'/.in'/, seq(l,imp))

[1] 100

> sum(runs$FDR. output [, l]7_{0}in°/_{0} seq(l,imp))

[1] 100

> sum(runs$Bon. output [, l]7oin7_{t} seq(l,imp))

[1] 100

With the inclusion of surrogate variables, the results are improved for the method in ttScreening and the Bonferroni approach, especially the latter. False positive selections of FDR-based approach gets more severe.