Methods to select genetic and epigenetic factors based on linear associations
In genetic and epigenetic studies, among a large number of candidate genes possibly associated with a health condition under study, it is desirable to identify important genes critical to disease risk. This type of effort is in the scope of variable selection problems. In this area, variable selection in linear or generalized linear regression models is often conducted. In addition to the methods built on Akaike information criterion (AIC) and Bayesian information criterion (BIC), more advanced variable selection methods have been proposed. In the following two subsections, we discuss methods under the frequentist framework and those built upon Bayesian inference. In all the variable selection methods, we consider the following classical linear regression model that has p candidate variables,
where /3 = (/3, • • • , fip)T and е* ~ Лг(0, a2).
In the following sections, we introduce frequentist and Bayesian approaches, and demonstrate the methods using simulated data. In the last section, we apply the methods to DNA methylation data.
We focus on three methods: elastic net, adaptive LASSO, and SCAD. Elastic net proposed by Zou and Hastie [1C6] is a regularization technique applied in variable selection and is a hybrid of LASSO  and ridge regression . This technique is capable to appropriately select variables when predictors are correlated. The adaptive LASSO in  is an extension of LASSO and aimed to reduce the estimation bias in the LASSO developed by Tibshirani  to achieve the oracle properties, meaning that the method will correctly select the model as if the correct submodel were known. Another variable selection method also enjoying the oracle properties is the nonconcave penalized likelihood method developed by Fan and Li , where a smoothly clipped absolute deviation (SCAD) penalty is introduced. To estimate regression coefficients on important variables, these methods aim to solve the following penalized objective function,
where Wi denotes weights and reflects the contribution of each observation, and l(yi,po + f3TXi) is the negative log-likelihood of each observation. When the distribution of у is Gaussian, /(•) is equivalent to a squared error of each observation. In the last term, p(/3, Л, •) is a penalty and is a function of /3 and other to-be-defined parameters represented by In all these techniques, a regularization parameter Л needs to be specified, usually determined through a grid search via cross-validation.
The penalty in elastic net is defined as
where a is a mixing parameter. When a = 1, p(/3.1) becomes the L penalty in LASSO and when a = 0, it is the /_2 penalty in ridge regressions. Since the penalty in LASSO penalizes the sum of absolute values of coefficients, when Л is large, many coefficients are shrunk to zeros, which never happens when the L-2 penalty as in ridge regressions is applied. Elastic net penalty, on the other hand, is a hybrid of these two types of techniques and it has been suggested that this approach can outperform LASSO with correlated candidate variables .
We have two parameters to tune in elastic net, a and Л, both of which can be tuned via re-sampling techniques. The R function glmnet in the package glmnet tunes Л for a fixed a, but it does not tune a. The R package caret is able to tune both a and Л through various re-sampling approaches such as cross-validation. It tests a range of possible a and Л values, and selects the best values for these two parameters which optimizes a statistic, e.g., prediction error (root mean square error) or deviance. To demonstrate the elastic net technique in variable selection, we first simulate a data set that has 10 candidate variables.
> sampsize = 200
> numVar = 10
> meanX = c(rep(2,3),rep(10,3),5,0.1,2, 0.8)
> varX = diag(c(rep(10,3),rep(5,3),5,2,1,2),length(meanX),
> x = mvrnorm(sampsize,meanX,varX)
> # the second and third variables are important variables
> beta = c(0, 0.5, l,rep(0,7))
> sig = diag(l,sampsize,sampsize)
> у = mvrnorm(l,x°/t*°/tbeta,sig)
> data = cbind(y,x)
Of the 10 candidate variables, two variables are associated with the response variable with regression coefficients 0.5 and 1. Next, we use the caret package to tune the two parameters, a and Л, in elastic net and select important variables (i.e., variables with non-zero regression coefficients).
> # Define training control
> train_Control = trainControl(method = "cv",number = 10,
+ search = "random")
The function trainControl is to set up the control parameters to be used in the training process to tune a and Л. The argument method is to specify a re-sampling approach. A number of approaches are available and commonly used approaches are cv for cross-validation and repeadedcv for Monte-Carlo cross-validation or repeated random subsampling validation. Two options are available for search, grid and random, to specify the parameter space for the purpose of identifying the best values of a and Л. Compared to grid, the option random covers the parameter space to a lesser extent. In this example, the re-sampling method is a 10-fold cross-validation indicated by method = "cv" and number = 10. Next, based on the selected methods and parameters, we start the training process using the function train.
> TuneParms = train(y~., data = data, method = "glmnet",
+ trControl = train_Control, tuneLength = 60)
> # Best tuning parameters
alpha lambda 56 0.8972444 0.1311761
The method argument specifies that elastic net is used in this tuning process. In our example, random search instead of grid search is used for candidate values of or and Л, as defined in trainControl, the setting tuneLength=60 in this case defines the maximum number of tuning parameter combinations (which is GO) that will be generated by the random search. On the other hand, if grid search is used, then GO is the number of levels for each tuning parameter to be generated. In our example, the best tuning parameters were reaching at the 56-th combination, which are 0.90 and 0.13.
Data of the predictors could also be pre-processed by use of the preProcess argument in train. For instance, to standardize the predictors to eliminate unit-dependence, we set preProcess=c("center", "scale").
The coefficients of the selected variables based on the tuned a and Л are extracted by coef. In total, two variables are selected, consistent with the underlying true model.
> coef(TuneParms$finalModel, TuneParms$bestTune$lambda)
- 11 x 1 sparse Matrix of class "dgCMatrix"
- (Intercept) 0.05154811 XI
Note that this process does not provide a statistical significance of each selected variable. To infer the statistical significance, a commonly chosen approach is to select the variables using a subset of data (assuming the sample size is larger than the number of candidate variables), and then use the remaining data to fit a linear regression model, e.g., using lm. to infer the statistical significance of each variable.
The approach of adaptive LASSO was proposed by Zou  and it was evolved from LASSO. Compared to LASSO, adaptive LASSO has the oracle properties under a proper choice of Л. The penalty in adaptive LASSO is defined as
where is the estimate of regression coefficients based on ordinary least squares or from the best ridge regression in the situation of collinearity. We use the same simulated data to demonstrate this approach, starting from inferring со) based on the best ridge regression via a 10-fold cross- validation.
> # Use ridge regression to construct the weights
> best_ridge<- cv.glmnet(x = x, у = y, type.measure = "mse",
+ nfold = 10, alpha = 0)
> lambda_tuned = best_ridge$lambda.min
> lambda_tuned  0.3551413
> best_ridge_coef = as.numeric(coef(best_ridge,
s = lambda_tuned)) [-1]
> # the weights, omega, calculated based on gamma=l
> omega = 1 / abs(best_ridge_coef)
 9978.806663 2.206854 1.056506 24.349330 23.465679
 21.591722 108.207493 71.379271 18.952099 15.469180
The parameter Л is tuned using 10-fold cross-validation in ridge regression using the R function cv.glmnet. The statistics mean squared error (mse) assesses the loss and is used to choose the best Л, and the value of Л corresponding to the smallest mse is 0.36. Next, w is estimated as 1/1/3, | setting 7 = 1 by first extracting the estimated coefficients (excluding the intercept) using coef and then calculating oj as omega = 1 / abs(best_ridge_coef). Based on the findings in , it seems taking 7 = 1 overall works reasonably well.
Finally, we apply the weights and utilize glmnet to select informative variables and infer the coefficients of those variables.
> # use glmnet with the calculated adaptive LASSO weights, omega,
> # to perform variable selection.
> adaLASSO = cv.glmnetCx = x, у = у,type.measure = "mse",nfold = 10,
+ alpha = 1,penalty.factor = omega)
> plot(adaLASSO,xlab = expression(paste("log(".lambda,")")))
> adaLASSOSlambda.min  10.78594
> coef(adaLASSO, adaLASSOSlambda.min)
- 11 x 1 sparse Matrix of class "dgCMatrix"
- (Intercept) -0.09148639 VI
In the above, a 10-fold cross-validation is used under the adaptive LASSO penalty specified by alpha=l and penalty.factor=omega. The tuned Л by minimizing mse using cross-validation is 10.79 (Figure 6.1). The selected variables using adaptive LASSO are consistent with the truth and the estimated coefficients are closer to the true values, compared to the results from elastic net.
Smoothly clipped absolute deviation (SCAD)
The smoothly clipped absolute deviation (SCAD) penalty is proposed by Fan and Li  and its derivative is given by
Figure 6.1 Mean square errors vs. logA in cross-validation to tune A in adaptive LASSO.
which leads to
j = 1, ■ • • ,p, where Fan and Li  suggested to set a = 3.7. The SCAD penalty is continuously differentiable from —oo to +oc except at 0, resulting in small coefficients being shrunk to zero with large coefficients un-penalized. The SCAD penalty, as the adaptive LASSO, also enjoys the oracle property . To apply the SCAD penalty in variable selection, the R package ncvreg can be applied. Again, we use the simulated data discussed earlier to demonstrate this approach using ncvreg.
> SCAD = cv.ncvreg(x, y, penalty = "SCAD", family = "gaussian",
+ nfolds = 10, alpha = 1, seed = 12345)
> SCAD$lambda.min  0.3551413
- (Intercept) VI V2 V3 V4
- -0.1134016 0.0000000 0.4984394 1.0312255 0.0000000
V5 V6 V7 V8 V9
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
As in the other methods, a 10-fold cross-validation is applied to tune A. Argument alpha is set at 1 to indicate the penalty is Ly penalty. The best A is selected as 0.36 by minimizing deviance (Figure G.2), based on which variables X2 and X3 are selected with estimated parameters close to the truth.