# EXAMPLES

In this section, we demonstrate the four approaches introduced in this chapter by applying them to the epigenetic data discussed in Chapter 0, where variables are selected based on their linear associations with the dependent variable, as well as simulated data sets where we know the underlying truth.

## Selecting important epigenetic factors

Recall, the following codes are used to extract the data and perform screening to identify CpGs potentially associated with a marker of interest, i.e., DNA methylation at CpG site cg00493554.

> library(MASS)

> library(IlluminaHumanMethylation450kmanifest)

> library(IlluminaHumanMethylation450kanno.ilmnl2.hgl9)

> library(IlluminaHumanMethylationEPICanno.ilml0b4.hgl9)

> library(IlluminaHumanMethylationEPICmanifest)

> library(prettyunits)

> library(bumphunter)

> library(DMRcate)

> data(dmrcatedata)

> DNAm = logit2(myBetas)

> dim(DNAm)

[1] 10042 76

> DNAmNoProbSNPs = rmSNPandCH(DNAm, dist = 10, mafcut = 0.05)

> dim(DNAmNoProbSNPs)

[1] 9126 76

> CpGs = rownames(DNAmNoProbSNPs)

> ID = factor(sub("-.*", colnames(DNAm)))

> Status = factor (sub(" colnames(DNAm)))

> library(ttScreening)

> runs = ttScreening(y = DNAmNoProbSNPs, formula = "Status,data=Status,

+ imp.var = 1, sva.method = "two-step",

+ B.values = FALSE, iterations = 100, cv.cutoff = 50,

+ n.sv = NULL, train.alpha = 0.05, test.alpha = 0.05,

+ FDR.alpha = 0.05, Bon.alpha = 0.05, percent = (2/3),

+ linear = "Is", vfilter = NULL, В = 5,

+ numSVmethod = "be", rowname = NULL,maxit = 20)

Number of significant surrogate variables is: 22 [1] "9 surrogate variables used in the analysis"

> out = runs$TT.output

> dim(out)

[1] 667 24

> maxFreq = max(out[,2])

> maxFreq [1] 100

> rowMaxFreq = as.numeric(as.character((out[which(out[,2]

+ ==maxFreq),1])))

> у = t(DNAmNoProbSNPs[rowMaxFreq,])

> rownames(y) = CpGs[rowMaxFreq]

> rownames(y)

[1] "cg00493554"

> x = t(DNAmNoProbSNPs[-rowMaxFreq, ])

> colnames(x) = CpGs[-rowMaxFreq]

*Results from cos so* We start from the method by Lin and Zhang, implemented in the R package cosso. Since the variable selection approach by Lin and Zhang [99] requires the number of variables smaller than the sample size, we use the first 20 variables to demonstrate. These 20 variables are the same as those discussed in Chapter 6. Function cosso is used to select the variables. Our goal is to identify DNA methylation sites such that DNA methylation at those sites is associated with DNA methylation at cg00493554. It is reasonable to use regression to the mean with a Gaussian distribution specified for random errors. We thus set f amily="Gaussian" in function cosso. We implement the following R codes to select CpGs.

> у = as.vector(у)

> results = cosso(newx,y.family = "Gaussian")

> colnames(results$tune$L2norm) = colnames(newx)

> norms = results$tune$L2norm[(ncol(newx)+2),]

Along with other outputs, the object results provides Z/2-norms evaluating the contribution of each variable, extracted by results$tune$L2norm. In results$tune$L2norm. all the Z^-norms calculated in the tuning process are included. In the above, we use the last set of Z/2-norins to choose important variables using the scree plot. The R codes are below,

> names(norms) = colnames(results$tune$L2norm)

> sortedNorm = sort(norms,decreasing=TRUE)

> plot(sortedNorm,pch = 19,cex = 0.5,ylab = "Sorted L2-norm")

> lines(sortedNorm)

> names(sortedNorm[1:3])

[1] "cg00171166" "cg00168514" "cg00309582"

Based on the scree plot using the sorted Z^-norms sortedNorm, three CpG sites are selected, cg00171166, cg00168514 and cg00309582.

*Results from **FWDselect* We first use the function test to determine the number of variables to be included in the model.

> selectq = test(newx, y, method = "gam", family = "gaussian",

+ nboot = 50, speedup = TRUE, bootseed = 12345, cluster = TRUE) [1] "Processing IC bootstrap for H_0 ( 1 )..."

Hypothesis Statistic pvalue Decision 1 H_0 (1) 15.86 0.06 Not Rejected

> selectq$nvar [1] 1

In this function, generalized additive models gam are used to assess the associations and 50 bootstrap samples are used to determine *q.* We set speedup=TRUE to increase the computing efficiency. In this case, when testing the hypothesis, the alternative models consider one more variable than the number of variables in the null hypothesis. Although this will speed up the testing process, wrong decisions can be made due to the limited assessment on models. Setting cluster=TRUE is to indicate that the testing procedure is parallalized to increase computing efficiency. Applying test to the 20 CpG sites, с/ = 1 is determined by calling results$nvar.

Next, given <7=1, we use selection to select the best variable from the 20 candidate variables.

> q = selectqlnvar

> results = selection(newx, y, q = q, prevar = NULL,

+ criterion = "deviance", method = "gam", family = "gaussian",

+ nmodels = 1, nfolds = 5, cluster = TRUE)

> results

Best subset of size q = 1 : cg00254133 Information Criterion Value - deviance : 8.511648

The information criterion deviance calculated using 5-fold cross validation (nf olds=5) is used to select the best models. Through this approach, CpG site cg00254133 is selected, which is not in any of the three CpG sites selected from cosso.

Alternatively, with similar settings, we can use qselection to identify a set of best models with different sizes with each determined by a pre-specified information criterion, e.g., deviance. Compared to the method in cosso, the method in FWDselect does not have a good control on the number of knots used in the regression splines, which limits the number of variables that are allowed in a model. In the following program, the maximum number of variables is set at 8. After comparing best models from size 1 to size 8, a model with one CpG site, cg00254133, is selected as the best model, indicated by * corresponding to the smallest deviance. This is the same result as that from using test and selection together.

> results = qselection(newx, y, qvector = c(l:8),

+ criterion = "deviance", method = "gam", family = "gaussian",

+ nfolds = 5, cluster = TRUE)

[1] "Selecting subset of size 1 ..."

[1] "Selecting subset of size 2 ..."

[1] "Selecting subset of size 3 ..."

[1] "Selecting subset of size 4 ..."

[1] "Selecting subset of size 5 ..."

[1] "Selecting subset of size 6 ..."

[1] "Selecting subset of size 7 ..."

[1] "Selecting subset of size 8 ..."

> results

q deviance

I 1 8.512

- 22 8.793
- 33 12.86
- 4 4 39.136
- 5 5 67.228 cg00254133,
- 6 6 23.415 cg00254133, cg00260888,
- 7 7 91.933 cg00254133, cg00260888, cg00105628,
- 8 8 27.257 cg00254133, cg00260888, cg00105628, cg00303672,

>

q deviance selection

II 8.512 cg00254133 *

- 22 8.793 cg00254133, cg00168514
- 33 12.86 cg00254133, cg00168514, cg00105628
- 4 4 39.136 cg00254133, cg00168514, cg00105628, cg00303672
- 5 5 67.228 cg00260888, cg00105628, cg00303672, cg00273813
- 6 6 23.415 cg00105628, cg00303672, cg00273813, cg00199007
- 7 7 91.933 cg00303672, cg00273813, cg00156569, cg00159400
- 8 8 27.257 cg00273813, cg00298324, cg00159400, cg00188822

*Results from the ANOVA-based approach* As in the method COSSO, the variable selection approach by Zambom and Akritas [153] does not allow the number of variables larger than the sample size. In the following, we use the same 20 variables.

> newx = x[,1:20]

> library(NonpModelCheck)

> out = npvarselec(newx,as.vector(y),method = "backward",

+ degree.pol = 1.kernel.type = "trun.normal",

+ bandwidth = "CV", dim.red = c(l, 10))

Iter. I Variables in the model

- 1 I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
- 2 I 1 2 3 4 5 6 7 8 10 11 12 13 14 15 16 17 18 19 20

- 3 I 1 2 3 4 5 6 7 8 10 11 12 13 14 15 16 17 18 19
- 4 I 1 2 3 4 5 6 7 8 10 11 12 13 14 16 17 18 19
- 5 I 1 2 3 4 5 6 7 8 11 12 13 14 16 17 18 19
- 6 I 1 2 3 4 5 6 7 8 11 13 14 16 17 18 19
- 7 | 1 2 3 4 5 6 7 8 11 13 14 16 17 18
- 8 I 1 2 4 5 6 7 8 11 13 14 16 17 18
- 9 I 2 4 5 6 7 8 11 13 14 16 17 18
- 10 I 2 4 5 6 7 8 11 13 14 16 18
- 11 I 4 5 6 7 8 11 13 14 16 18
- 12 I 4 5 6 7 8 11 13 14 18
- 13 I 4 5 6 7 8 13 14 18
- 14 I 4 6 7 8 13 14 18
- 15 I 4 6 8 13 14 18
- 16 I 4 6 13 14 18
- 17 | 4 13 14 18
- 18 I 13 14 18

> out

Number of Covariates Selected: 3 Covariate(s) Selected:

Covariate Index I p-value

- 13 I 0.0053
- 14 I le-10
- 18 I 1.3e-09

> # selected variables

> colnames(x)[out [ [1]]]

[1] "cg00246386" "cg00254133" "cg00298324"

The last two selected CpG sites, cg00254133 and cg00298324, are also selected by all the methods noted in Table 6.1 in Chapter 6. Applying the two forward selection approaches, forward and forward2, different variables are selected. In particular, with forward, only one CpG site, cg00105628, is selected, and with forward2, cg00105628, cg00246386. and cg00205332 are selected. However, none of these CpGs are selected when the backward selection approach was applied.

In the codes above, kernel.type specifies the kernels used to fit local polynomial regression. In the above, trun.normal kernel is selected. It is a Gaussian kernel truncated between —3 and 3. The argument bandwidth is the bandwidth in the kernel and is determined based on cross validation by setting bandwidth="CV". The argument degree.pol specifies the order of local polynomials used to fit the unknown associations. After increasing the degree, the results overall show no difference with degree. pol=l. The last argument dim. red is for dimension reduction aiming to improve computing efficiency, dim. red is a vector with the first element being 1 denoting sliced inverse regression (SIR) and 2 for supervised principal components (SPC). The second element in the vector gives the number of slices (if SIR), or the number of principal components (if SPC). If setting dim.red=0, then no dimension reduction is performed. We suggest not setting it to zero but choosing between SIR and SPC for the purpose of increasing the computing speed.

*Results from the reproducing kernel-based approach* We next apply the method proposed by Zhang et al. [1G0] to this epigenetic data. We use source to call the function. The dependent variable Y is the same as before, DNA methylation at CpG site cg00493554. Since we do not have any covariates assumed to be linearly associated with the response variable, the design matrix for the linear part is a vector of l’s, and we set numCov=0. The regularization parameter *po* is set at 1 following suggestions of Zhang et al. [160]. Arguments aO and bO are the hyper parameters in the prior distribution of the random error variance parameter *o**' ^{2}* in the regression model. These two parameters are chosen as being non-informative and set at 0.001. Two chains of Markov Chain Monte Carlo simulations are run, which can be used to assess the potential of convergence.

> source("VarSelKernelFunctionsContinuous.R")

> Y = as.vector(y)

> X = rep(1,length(Y))

> Candid = newx

> VarSelResults = VarSel(Y, X, Candid, rho = 1, numCov = 0,

+ numltr = 100, numChain =2, aO = 0.001,b0 = 0.001)

The method of Zhang et al. identifies three CpG sites, cgOO 105628, cg00254133, cg00298324, such that DNA methylation at these three sites is possibly associated with DNA methylation at cg00493554. The sorted estimated probabilities of variable selection in ascending order are stored in orderProp, as indicated in the codes below. The estimated probability of selecting the top three CpG sites is 1, which is substantially larger than the fourth CpG site to be selected, implying the importance of these three CpG sites. These three selected CpGs overlap with the ones selected using the three approaches in npvarselec.

> colnames(Candid)[VarSelResultsSVarSelect]

[1] "cg00105628" "cg00254133" "cg00298324" > VarSelResults$orderProp

Variable Index Post. Prob

**[1,] 2 1.0000000**

[2,] 14 1.0000000

[3,] 18 1.0000000

[4,] 5 0.1333333

[5,] 1 0.0000000

[6,] 3 0.0000000

In this example, the data are continuous. In the situation of a mixture of continuous and ordinal data, the method by Terry et al. [137] can be applied by first standardizing the data and then applying the program VarSelKernelFunctionsContinuous .R. Compared with the methods in cosso, FWDselect and npvarselec, the methods by Zhang et al. and Terry et al. [160, 137] have the potential to deal with large numbers of candidate variables such that the number of independent variables is larger than the sample size.

## Selecting variables with known underlying truth

To have a better understanding of the methods in terms of their performance in different situations, we simulated a data set based on the model below,

where *Xi* is a vector of length *p =* 12 representing a set of candidate variables and *N =* 100. Among the 12 candidate variables, the first three are important variables such that *Y* is non-linearly associated with the interaction of *X* and *X-**2**,* and linearly associated with A3. A simple program to generate the data is below, in which the candidate variables are generated from uniform distributions each with different lower and upper bounds.

> set.seed(seed = 1000)

> n = 100

> numCov = 0

> numX = 12

> #generate candidate variables

> x = matrix(rep(0,numX*n),ncol = numX)

> for (i in l:numX)

+ {

+ x[,i] = runif(n,0.0001, numX/(2*i))

+ 1

> # non-linear interaction

> xlnter = 3*cos(x[,1]*x[,2])

> delta = c(0,0,1,rep(0,numX-3))

> xEff = rep(2,length = numX)*delta

> #generate observations of у

> mu = x'/,*V,xEff+xlnter

> Var = diag(0.25, n, n)

> у = rmnormd,mu,Var)

> у = t(y)

> data = cbind(t(y),x)

We apply the methods discussed so far to the simulated data, starting from the method in cosso. The program is comparable to the program discussed earlier for the epigenetic data. Again, we use a scree plot to assist our decision on the number of important variables by plotting the sorted Z/2-norms. For the purpose of illustration, we include the scree plot in Figure 7.1. The plot suggests three important variables and the three correct variables 1, 2, and 3 are selected by checking the names of sortedNorm.

> Y = data[,l]

> X = rep(1,length(Y))

> Candid = as.matrix(data[,-l])

> results = cosso(Candid,Y,family = "Gaussian")

> colnames(results$tune$L2norm) = seq(l,ncol(Candid))

> norms = results$tune$L2norm[(ncol(Candid)+2),]

> names(norms) = colnames(results$tune$L2norm)

> sortedNorm = sort(norms,decreasing = TRUE)

> plot(sortedNorm,pch = 19,cex = 0.5,ylab = "Sorted L2-Norm")

> lines(sortedNorm)

> names(sortedNorm[1:3])

[1] "2" "1" "3"

The two approaches in the package FWDselect give different results. Similar programs as in the real data applications are applied and thus they are not included here. With selection paired with test, variables 1 and 3 are selected, while with qselection, including variables 1 to 3 plus two additional variables, variables 8 and 11, give the smallest deviance. Accompanied by the real data application, the methods in FWDselect seem to not perform well in general.

Figure 7.1 The scree plot of L-2-nonns for the simulated data.

We next apply the backward selection method in npvarselec with the following codes. Again, we omit the program here to avoid duplications. With the backward selection method, variables 3 and 8 are selected. The other two approaches, forward and forward2, give the same results. The true variables 1 and 2 are not selected by any of the three approaches. It is unclear whether this is due to the Zadaraya- Watson kernel regression estimation of *mi(Xn)* in the model of (7.4). Finally, we apply Zhang et al. method to the same data,

> Y = data[,l]

> X = rep(1,length(Y))

> Candid = as.matrix(data[,-l])

> VarSelResults = VarSel(Y, X, Candid, rho = 1, numCov = 0,

+ numltr = 500, numChain = 1, aO = 0.001,b0 = 0.001,sigB2 = 10)

> VarSelResults$VarSelect [1] 123

>

> VarSelResults$orderProp

Variable Index Post. Prob

[1,] 1 1.0000

[2,] 2 1.0000

[3,] 3 1.0000

[4,] 11 0.0848

[5,] 12 0.0848

The first three variables, which are the truly important variables, are selected by the Zhang et al. method with probability 1. For this simulated data, the method in Zhang et al. [160] outperforms the three approaches in the package npvarselec.

We further apply Zhang et al. method to a situation such that the number of variables is larger than the sample size. Similar simulation scenarios are applied except that the number of candidate variables is increased to 100 with a sample size of 100 and the candidate variables are generated from uniform distributions with lower and upper bounds specified as the following,

> for (i in l:numX)

+ {

+ x[,i] = runif(n,0.0001, numX/(20*i))

**+ }**

Including the newly simulated data in the data matrix as done earlier and running the function VarSel with similar settings as before,

> VarSelResults = VarSel(Y, X, Candid, rho = 1, numCov = 0,

+ numltr = 100, numChain = 1, aO = 0.001,b0 = 0.001,sigB2 = 10)

The three variables are all correctly selected with a posterior selection probability of 1 and are the only three variables selected based on the scree plot noted earlier, although the posterior probabilities of 30 other variables are all >0.5.

> VarSelResults$VarSelect [1] 123

> VarSelResults$orderProp

Variable Index Post. Prob [1,] 1 1.00000

**[2,] 2 1.00000**

[3,] 3 1.00000

[4,] 50 0.68750

[5,] 63 0.65625

[6,] 92 0.65625

[7,] 48 0.62500

[8,] 100 0.59375

[9,] 28 0.56250

[15,] 99 0.56250

[16,] 29 0.53125

[21,] 69 0.53125

[22,] 34 0.50000

[33,] 98 0.50000

[34,] 21 0.46875

[98,] 5 0.03125

[99,] 4 0.00000

[100,] 6 0.00000

Among all the four approaches discussed in this chapter, in the DNA methylation example, findings from the method by Zhang et al. [1G0] agree with most of the other three approaches; in the example based on simulated data, the methods in cosso and Zhang et al. [1G0] perform the best. The advantage of Zhang et ah, as well as Terry et al. [137], exists in their ability to select variables such that the number of candidate variables is larger than the sample size.