# SURE INDEPENDENCE SCREENING

This approach treats screening factors as dependent variables, and all variables to be screened are included in a regression model as independent variables. In this case, the screening is built upon variable selections. The method is proposed by Fan and Lv [38], and is composed of two steps. The first step uses the concept of sure independence screening, also known as correlation learning [45], to filter out the variables that have weak correlations with the screening factor. A property of sure screening is that all the important variables will be kept after variable screening with probability approaching to 1. Sure independence screening is then followed by implementation of variable selection techniques, such as the SCAD method proposed by Fan and Li [37], the Adaptive LASSO by Zou [165], the Dantzig selector by Candes and Tao [16], or other Bayesian methods [52, 159].

## Correlation learning

The sure independence screening method is built upon correlation learning. Correlation learning is an old and computationally simple approach for variable selection. It starts from a set of simple linear regressions with predictors (variables) standardized and the response variable (screening factor) centered. Following the notations given by Fan and Lv [38], estimates of regression coefficients in these simple linear regressions are defined as /3 = *X ^{T}y/n,* where

*X*is the

*n*x

*p*data matrix of variables with n the sample size and

*p*the number of variables, and

*у*= (У),-- - ,

*Y*is the centered screening factor. It is not difficult to see that

_{n})^{T}*X*is a vector of sample correlations between each of the variables and the screening factor, re-scaled by the standard deviation of

^{T}y*y.*

Let Л4, denote the true model with size *s* = |Л4*|, and _M_{7} denote a submodel with size [yn], where 7 is a tuning parameter that needs to pre-specify and [7n] denotes the integer part of 7*n.* Applying the sure independence screening, variables in the submodel Л4_{7} are chosen based on *ui* = *nPi* such that the top [yn] variables are selected, and included in the next step of screening. Fan and Lv showed that

for some given 7. Under some regulatory conditions, it is suggested taking [7n] = n — 1 or [7n] = n/log(n).

With variables to be screened in the regressions as independent variables, there are some concerns in the selection of truly important variables using the sure independence screening approach. Unimportant variables highly correlated with the important ones may be selected. In general, collinearity adds in difficulty to correctly select truly important variables. Also, a variable that is marginally uncorrelated but jointly correlated with the screening factor will not be selected. To overcome these problems, Fan and Lv proposed two solutions. One approach iteratively applies the sure independence screening approach, and the other approach groups and transforms the variables aiming to improve the chance of selecting the correct variables and reduce collinearity.

*Iterative correlation learning* Iterative correlation learning is to apply the correlation learning approach iteratively to variable selections. This approach first selects a subset of variables using the sure independence screening followed by variable selection techniques such as SCAD or LASSO based on joint information of *[nj* log(n)] variables chosen from the correlation learning. Denote this set of selected variables as *A.* Next, the residuals are calculated by regressing the screening variable *у* over the selected variables. The residuals are then treated as new responses, and are used in the next step of sure independence screening followed by variable selections over the set of variables not in *A.* The set of selected variables is denoted as *A**2**,* which is mutually exclusive with *Ai.* The residuals are uncorrelated with variables in *Ai,* and thus the selection in this step will be less influenced by high correlations between unimportant and important variables. The strength of using residuals will fulfill the goal of selecting important variables that are marginally uncorrelated but jointly correlated with the response. However, if variable *A* is highly correlated with a variable of interest, then variable *A* may still want to be selected. Using residuals as the response may exclude such type of variables.

Nevertheless, the process of sure independence screening followed by variable selection continues and each time we obtain a set of selected variables disjoint with previous sets of selected variables. The final set of variables that pass the screening is the union of all the disjoint sets

where *l* can be chosen such that *A = d < n,* as suggested by Fan and Lv [38].

*Variable grouping and transformation* The goal of grouping variables is to utilize joint information from a group of variables. The sure independence screening approach is applied to the groups, which are then included in subsequent variable selections. However, this approach may not be able to solve the concerns on correlations and collinearity. Transforming the variables is likely to be more meaningful. Fan and Lv [38] suggested two approaches: subject-related transformation and statistical transformation. Subject-related transformations are comparable to the transformations commonly used in statistical analyses, e.g., centering variable, or using differences in variables, to reduce correlations or collinearity. Statistical transformations utilize principal components analyses to identify weakly correlated new variables, on which the sure independence screening are applied to select variables. One disadvantage of this approach is the interpretation of selected variables. In genetic and epigenetic related studies, each component may cover different genes on different chromosomes and it may be hard to summarize the functionality of identified components, and subsequent selection results.

To implement the approach of sure independence screening coupled with variable selections, Saldana and Feng [123] developed an R package built upon the work of Fan and Lv [38]. The SIS function in the package implements the iterative sure independence screening approach, and then uses the R packages ncvreg and glmnet for variable selections using SCAD, minimax concave penalty (MCP), or LASSO.

> model = SIS(x, y, family = "gaussian", penalty = "lasso",tune = "bic")

> model$ix

In the above, x provides the design matrix for *p* variables with *n* rows and *p* columns, and у is the response vector (for the screening variable) with length of n. The argument family refers to the assumed family of distributions of data. In addition to gaussian, other choices of family are binomial for binary response variable and cox for time to events data. If family="cox", then у should be an object from Surv() in the package survival. Penalty is utilized in variable selections to exclude variables that do not contribute to the explanation of variations in y. The default penalty is SCAD. A tuning parameter is critical in variable selections. Besides bic, other tuning parameter selectors include, "ebic" (Extended Bayesian Information Criteria,[19]), "aic", and "cv" (cross validation). The indices of the selected variables is included in the vector ix.

# NON- AND SEMI-PARAMETRIC SCREENING TECHNIQUES

In addition to the parametric methods, non-parametric and semi- parametric approaches have been proposed to screen variables. In this section, we discuss two types of such approaches, random forest-based methods and support vector machine.

## Random forest

Random forest [12] is a machine learning approach for classification or regression. It constructs a large number of decision trees or classifiers and aggregates the results from the trees to reach a final conclusion on classifications and regressions. The approach is built upon a successful machine learning technique, bagging [11]. Each tree in bagging is independently constructed using bootstrap samples, and final decision is reached by a majority vote.

Random forest, compared to bagging, adds another layer to the construction of trees. For each bootstrap sample, instead of building trees using all the variables, in random forest, each tree is built upon a subset of variables. That is, at each node of a tree, the determination of splitting is based on a subset of randomly selected variables instead of using all the variables. An immediate advantage is the computational efficiency. More importantly, this strategy is robust against over fitting and performs well compared to other machine learning approaches such as support vector machines to be discussed later.

An R package randomForest is available to implement the algorithm proposed by Breiman [12]. Liaw et al. [98] discuss the usage and features of the package, and provide insightful notes for practical application of the package. In the following, we outline the algorithm, followed by examples of using randomForest.

*The algorithm in randomForest* The algorithm starts from drawing a random bootstrap sample from the original data.

- 1. Draw a random subset of data from the complete data.
- 2. At each node, randomly select a subset of variables, and determine the best split based on those variables. Do these for all the / nodes, where / is the maximum number of nodes allowed.
- 3. Repeat steps 1. and 2. for a number of bootstrap samples to build a forest.

- 4. Use the rules of each tree in the forest to predict the outcome of the remaining data. The remaining data are not used in the construction of the forest, and thus are called out-of-bag (OOB) data.
- 5. A prediction of the OOB data is made by aggregating across the forest. It is majority votes across all the trees for classifications and average across all tree outcomes for regressions. An OOB prediction error rate is calculated for the OOB data. For classifications,
*%*of misclassification is the prediction error, and for regressions, mean squared errors are calculated by comparing the observed and the predicted values.

The randomForest package provides an important statistics, named “variable importance” that can be used to select important variables. Variable importance is evaluated based on a variable’s contribution to reduce OOB prediction error rates, and estimated based on permutating the values of the variables of interest. The above algorithm is implemented in the function randomForest of the package. We use the Forensic Glass data set in Ripley [119], which is applied in Liaw et al. [98], to demonstrate this function. The data has 9 measured physical characteristics for each of 214 glass fragments and these fragement are classified into 6 categories (noted as type in the codes below).

> library(randomForest)

> library(MASS)

> data(fgl)

> head(fgl)

> set.seed(17)

> RFobject = randomForest(type " ., data = fgl,

+ mtry = 2, importance = TRUE,

+ do.trace = 100)

The data format is as the following,

> head(fgl)

RI Na Mg Al Si К Ca Ba Fe type

- 1 3.01 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 WinF
- 2 -0.39 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 WinF
- 3 -1.82 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 WinF
- 4 -0.34 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 WinF
- 5 -0.58 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 WinF
- 6 -2.04 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 WinF

When building the forests, mtry=2 variables are selected from the 9 available ones and used in the determination of split at each node. The default values for mtry is *у/p* for classifications, and *p/3* for regressions. By setting importance=TRUE, importance of each predictor will be estimated, and setting do. trace=100, OOB errors are printed for every 100 trees. The importance values will help select important variables for subsequent analyses, as demonstrated below.

> GiniSorted = sort(RFobject$importance[,8], dec = TRUE)

> Labels = names(GiniSorted)

> plot(GiniSorted, type = "h",xlab = "Physical characteristics", xaxt = "n",main = "Decrease in mean Gini indices")

> axis(side = 1, at = seq(l,9), labels=Labels)

In the above codes, RFobject$importance is a matrix composed of importance values for each physical characteristic across all the G categories as well as averaged importance values. The 8-th column of the matrix is the average Gini indices, which can be used to determine which characteristic could be removed from further analyses for any of the G categories. To visualize the pattern, we sort the average Gini indices using the function sort, and the last two lines above generate the graph shown in Figure 4.2. Using the concept of scree plot introduced in Chapter 2, *В a* and *Fe* are likely the least important variables.

*Recursive random forest.* Applications of random forest in variable screening have been discussed in various studies [29, 47]. Measures of variable importance from the random forest output give users some idea about the importance of each variable. However, t iny can be used more effectively. Granitto et al. [61] and Diaz-Uriarte and De Andres [2G] further independently utilize variable importance, along with OOB error rates, and include them in an iterative process of screening important variables. The advantage of iteratively implementing random forest is the potential of improving stability of variables that pass the screening, and consequently being beneficial to convincible biological interpretabil- ity of results obtained. The approach of Granitto et al. [Gl] and that of Diaz-Uriarte and De Andres [2G] are comparable. In the following, we focus on the approach of the later [26].

The procedure of recursive random forest starts from running one random forest will all the candidate variables. We then extract variable importance measures from the random forest outputs. After sorting the importance measures, variables with importance measures at the lower

Figure 4.2 Sorted mean Gini indices of each physical characteristic from randomForest (the Forensic Glass data).

end are dropped. The percentage of variables to be dropped is a subjective selection and controlled by the aggressiveness of the user. The random forest procedure is then run again based on the subset of the variables. The process continues until the OOB error rates meet a prespecified cut off, then the variables from the prior step becomes the selected variables to be used in subsequent analyses (Figure 4.3). Two criteria are proposed to specify the cut off that stops the iterations [20]. The first, one is to choose the smallest number of variables whose error rate is within one standard errors of the minimum error rate of all forests, or selecting the set of variables that give the smallest error rate. Based on the findings noted by Diaz-Uriarte and De Andres [26], using the one standard error rule tends to choose smaller numbers of variables. In the diagram 4.3, the rule of minimum error rate is employed.

## Support vector machine

Support vector machine (SVM) is a classification algorithm and it finds a separating hyperplane with the maximal margin between two classes of data determined by support vectors [143]. SVM is originally designed as

Figure 4.3 The flow chart of the recursive random forest screening algorithm [26].

a classifier for binary outcomes, and has been extended to deal with multiclass problems. Some studies compared between random forest (RF) and SVM and suggested that RF can outperform SVM for specific type of data [101], but in bioinformatics studies, it seemed SVM has been recommended over RF [136]. Gaussian kernels are commonly applied in SVM to flexibly classifying the objects. In addition to be applied as classifiers, the weights of an SVM classifier can be used to estimate importance values for each variable. The fit function in the package rminer is able to estimate such importance values, based on which we screen variables.

> libr air у (rminer)

> M = fitCtype ' ., data = fgl, model = "svm", kpar = list(sigma = 0.10))

> svmlmp = Importance(M, data = fgl, method = "1D-SA")

> Importance = svmlmp$imp[l:9]

> names(Importaince) = colnames(fgl) [1:9]

> sortlmp = sort(Importance, dec = TRUE)

> Labels = names(sortlmp)

> plot(sortImp, type = "h",xlab = "Physical characteristics",xaxt = "n",

+ main = "Decrease in mean Gini indices")

> axis(side = 1, at = seq(l,9), labels = Labels)

We first run the svm classifier using the function fit. Many other choices for model are available, e.g., bagging and lm for regular linear regressions. A Gaussian kernel is used and the tuning parameter in the kernel is specified by kpar. which is 0.10. The importance values are then determined using sensitivity analyses in Importance. Several sensitivity analyses methods are available and the default approach is 1D-SA. a onedimensional sensitivity analysis not considering interactions. Once the importance values are inferred (svmlmpSimp), we plot the importance values (Figure 4.4) and can use the scree plots introduced in chapter 2 to determine variables potentially important.

Figure 4.4 Sorted importance values of each physical characteristics from SVM in fit (the Forensic Glass data).