# Non- and semi-parametric methods to select genetic and epigenetic factors

The methods discussed in Chapter G focus on selecting variables in parametric linear models. It is considered that genetic and epigenetic effects on the outcome are not linear and can be in any unknown form. Methods for feature selections in non-linear models have been developed [18, 11C, 99, 126]. These methods are generally built upon splines or Taylor series expansions and may have difficulty in accommodating a large number of predictors and describing complex interaction effects. In genetic or epigenetic studies, the number of candidate predictors is relatively large due to biological uncertainty. Genes or epigenes (genes associated with epigenetic variants) do not necessarily function individually, rather, they work in concert with others to manifest a disease condition. In the area of machine learning, methods of variable selection in semi-parametric models constructed based on reproducing kernels have been discussed [121, 160, 153, 154], which are able to deal with complex interactions. Terry et al. [137] proposed a unified variable selection approach that has the ability to select variables from a mixture of continuous (e.g., DNA methylation) and ordinal variables (e.g., the number of alleles in a genotype, 0,1, or 2). In this chapter, we will discuss two methods based on splines, Lin and Zhang and Sestelo et al. [99, 12G], one approach based on Zadaraya-Watson kernel regression estimation proposed by Zambom and Akritas [153], and the methods by Zhang et al. [160] and Terry et al. [137], which utilize reproducing kernels.

In the following sections, we introduce each of the four approaches, followed by demonstrations of these methods in Section 7.4 using the epigenetic data discussed in Chapter 6 as well as simulated data.

## VARIABLE SELECTION BASED ON SPLINES

Lin and Zhang [99] proposed an approach that has the ability to deal with non-linear models, named as the COmponent Selection and Smoothing Operator (COSSO). Compared to other penalty-based variable selection approaches, the penalty function in COSSO is the sum of component norms rather than the squared norms as in traditional smoothing spline methods. The penalized objective function in COSSO is defined as

where *p* represents the number of orthogonal subspaces in a function space *J~* and for the situation of main effects space *p* is the number of candidate variables. Parameter *т _{п}* is a smoothing parameter to be tuned. The penalty •/(/) is the summation of reproducing kernel Hilbert space norms. It is a convex function and is a pseudo-norm. That is, for any

*f,9*€

*F, J(f)*> 0,

*J(cf) = cJ(f),J(f + g*) <

*J(f)*+

*J(g),*and for any nonconstant / G

*J~.*•/(/)

*>*0. We have

Lin and Zhang [99] show that COSSO exists due to the convexity of the penalized objective function. An equivalent form of the penalized objective function (7.1) is given as [99],

where Ao can be any positive value and A is a smoothing parameter that needs to be tuned. To tune A, Lin and Zhang [99] propose to use 5-fold cross validation or generalized cross-validation to minimize risk.

The method of COSSO has been built into R and the R function cosso can be applied to different types of regression models such as regression to the mean, logistic regressions, quantile regressions, and Cox models.

Also using splines to describe non-linear associations, Sestelo et al. [126] developed an R package FWDselect for variable selection in an additive regression model,

where *rrij.j* = 1, • • • *,p,* are smooth and unknown functions, represents random errors with mean of zero, and *E(rrij(Xj*)) = 0 implying *E(Y)* = /?o- In Sestelo et al. [126], penalized regression splines are applied to estimate *rn _{:J}.*

The variable selection procedure has two parts. One part is to select the best combination of *q* variables for a given *q.* This is achieved by optimizing information criteria such as AIC or deviance and the selection of variables is through a forward stepwise-based procedure. The other part is to determine the optimal number of covariates to be included in the model, through hypothesis testing,

where / is an indicator function. The hypothesis testing is based on the contribution of additional *p — q* variables on their ability to explain the variation in *Y,**.* An empirical p-value is estimated based on wild bootstrap samples for the purpose of making a decision on the hypothesis testing. To select variables, two functions in the package FWDselect are needed, selection and test. Function test is used to determine the best *q,* and function selection is then applied to select the best combination of *q *variables. Alternatively, function qselection can be applied. For each sub-model, this function calculates the information criterion, e.g., AIC or deviance, based on which the best model can then be determined.

## OVERVIEW OF THE ANOVA-BASED APPROACH

The approach proposed by Zambom and Akritas [153], denoted as the ANOVA-based approach throughout this section, applies kernel regression to estimate any non-linear associations.

Define a regression model in the following structure,

where *m* 1 (X [,) represents an unknown function describing the association between *Y,* and *p* — 1 dimensional vector Хц, m^X-ii) is the other component in the regression with *m*г(-) unknown and A2 a univariate variable, and e* is the random error such that *е _{г}* ~

*N(0,a*The null hypothesis is

^{2}).

that is, *E(YX, X**2**) = mi(Xi)* under *Ho-*

To deal with high dimensional data, Zambom and Akritas [153] adopted the idea of Akritas and Papadatos [2], which performs hypothesis testing in high-dimensional one-way analysis of variance analyses (ANOVA). For regressions, however, mi(X|,) is unknown and for each specific measure of a variable in *X]* (noted as “cell” in Zambom and Akritas [153]), in general there will be only one observation of *Y* To utilize the technique developed for high-dimensional one-way ANOVA, Zambom and Akritas [153] estimated *т{Хц)* using Zadaraya-Watson kernel regression estimation,

where Л'#„(ж) = *H _{n}~^{1}K(H~^{l}x),* with

*K(-)*being a kernel with a bounded width, and

*H*is a positive definite symmetric matrix with dimension

_{n}*(d*— 1) x

*(d —*1) noted as a bandwidth matrix. To solve the single observation per “cell’ problem, Zambom and Akritas [153] proposed to augment the data by artificially grouping T^’s into groups. The augmentation is on

*Хц*due to the construction of

*Ho-*With these two problems solved, to perform variable selections, a test statistic defined as

*n*—

^{l}H(MST*MSE)*is used to carry out a testing on

*Ho*with MST denoting mean squares for treatment and MSE for mean squares of error. Clearly, this approach does not have the ability to select variables if the number of candidate variables is larger than the sample size.

Based on p-values inferred from the test statistic, Zambom and Akritas proposed backward and forward approaches to select variables associated with dependent variable *Y,* which are available in the R package NonModelCheck [154].

**The R package**

The R function npvarselec in the package NonModelCheck [154] implements the methods proposed by Zambom and Akritas [153]. In total, three selection algorithms are available, backward, forward, and forward2. The backward selection procedure is introduced in Zambom and Akritas [153]. It starts from a model with all the variables included. Then at each step, the algorithm eliminates the least significant variable if its p-value is larger than a cut off value after correcting for false discovery rate (FDR) applied to dependent tests [8]. Correcting for multiple testing assuming dependent tests is under the consideration of potentially joint activities among genes. The forward selection approach, on the other hand, starts from a null model. At each step, it temporarily adds in a variable with the smallest p-value, taking into account the variables already in the model. This candidate variable will be retained if each variable in the new model is statistically significant after controlling for FDR. The forward2 approach is a slight extension of forward. In this case, a variable is added to the model if this new variable results in the most significant new model, even though this variable itself is not the most significant. For the purpose of comparing between different methods, we hold up our demonstration of the package until a later section, Section 7.4.

## VARIABLE SELECTION BUILT UPON REPRODUCING KERNELS

Zhang et al. [1G0] proposed a variable selection approach targeted at variables associated with a dependent variable and the association can be in a complex form. In their approach, they allow a set of variables, *Z* not to participate in the selection and those variables will be kept in the final model. Another set of variables, *X*, are candidate variables and any variables in *X* not associated with the response variable *Y* will be excluded. To model the association of *Y* with *Z* and *X,* we consider the following setting to evaluate their effects,

where У is a vector of length *N,* and linear associations are assumed between *Y* and *Z.* The association of *Y* and *X* is defined by function /?(•), which is possibly complex and non-linear. The assumption on random error e is the same as before, i.e., e ~ *N(0,a ^{2}I)* with

*I*being an identity matrix. All variables in

*Z*will be kept in the model and /3 describes additive linear effects of

*Z.*Assume in total

*p*variables in

*X*and not all variables in

*X*are important. Unimportant ones in

*X*need to be identified and excluded. Function

*h(-)*is described through a kernel function /i (•,•). By Mercer’s theorem [109, 23], under some regularity conditions, the kernel function A’(-, •) specifies a unique function space

*H*spanned by a particular set of orthogonal basis functions. The orthogonality is defined with respect to

*L*

*2*norm. Following Mercer’s theorem, any function

*h(-)*in the function space

*H*can be represented as a linear combination of reproducing kernels [23, 55],

*h(Xi) =*£(=1

*K(X*, where a =

_{U}X_{t})a_{t}= K[oc*(a,, l = 1. ■*• • ,

*N)'*is а vector of unknown parameters and

*K[*is the ith row of kernel matrix

*К*This definition of /;(•) introduces a potential to capture complex interactions between variables in

*X*via the kernel function

*К*and has the ability to handle a large number of variables. In Zhang et al. [160], a Gaussian kernel,

*K(p),*is suggested with entry

*кц*defined as

with *i, l =* 1, • • • , *N, j =* 1, • • • *,p,* where *Хц* is the measure of variable *j* of subject *i* and *p* is a regularization parameter.

To select variables from *X,* an indicator variable 5 = {£ |y = 1-- . p} is further included into the kernel matrix with *Sj =* 1 indicating the inclusion of variable *j* and 0 otherwise. With the involvement of *Sj,* we denote the kernel matrix as *К(p. S)* with its (?'. /)-th entry defined as

If variable *j* is excluded, then it will not appear in any entry of the kernel matrix. Under this setting, Zhang et al. [160] developed a fully Bayesian approach to infer parameters /3. *cr ^{1}* and

*8.*For the regularization parameter

*p.*we see that different values of

*p*with different sets of selected variables can result in the same

*K(p)*and consequently the same likelihood. Thus, it is suggested to fix

*p*at

*p = po*with

*po*being the value estimated at the full model with all variable included. This is under the consideration that unimportant variables do not significantly contribute to the joint effect of

*X.*In addition, through sensitivity analyses on variable selection results with respect to different choices of

*po,*Zhang et al. [160] suggests take

*po =*1.

Each candidate variable is selected based on its posterior probability of being included in the model. A posterior probability of a variable is calculated as the percentage of times that the variable is selected among a certain number of uncorrelated Markov Chain Monte Carlo samples. To make a decision, the concept of a scree plot is applied to the posterior probabilities. Scree plots are often used in principal component analysis to determine the number of components, where a sharp decrease in eigenvalues indicates less importance for the rest of the components. Analogously, a sharp decrease in sorted posterior probabilities indicates less importance of the remaining variables. Variables identified by this rule are treated as the most important variables.

This method is not available as an R package, but an R program (VarSelKernelFunctionsContinuous.R) that implements the method can be downloaded from https://www.memphis.edu/sph/contact/ faculty_prof iles/zhang.php.

**Dealing with a mixture of continuous and ordinal variables**

The approaches discussed so far are designed for continuous variables. In genetic and epigenetic studies on various health conditions, potentially joint activities of single nucleotide polymorphisms (SNPs) and DNA methylation to manifest a disease status have drawn investigators’ attention [161, 133, 86]. Methods with the ability to select informative SNPs and DNA methylation sites under the consideration of their possibly joint work are of great interest . To fit this need, extended from the method by Zhang et al. [160], Terry et al. [137] proposed an approach to select variables that are a mixture of ordinal and continuous variables. They standardize both the ordinal and continuous variables for the purpose to diminish the scale difference between these two types variables,

The Gaussian kernel is applied to the standardized variables as in Zhang et al. [1G0] to describe complex associations between a set of candidate variables and an outcome of interest,

By standardizing, all variables are measured on the same scale. The functionality of the Gaussian kernel is invariant to this transformation in that it still evaluates the distance between subjects. Standardizing all variables has some drawback for ordinal variables of yielding only a few different values say (0,1,2), but based on previous studies, such transformations do not seem to bias inferences [108, 81].

For the selection of p_{0}. note that *Var(gf)* —> 1 as *n* —> oo. This leads to *Var{(j* _{ly}* —

*t/**

*2*) —>■ 2 assuming no correlations between the predictors. The stabilization in the predictors due to standardization suggests setting

*po*as

*p*= 2. The effectiveness and robustness of such a selection seems to be supported by various simulations discussed in Terry et al. [137].

_{0}To apply the method in [137], the data are standardized first, and then we can apply the same program in Zhang et al. [100] with the regularization parameter set at 2 or use the approaches in the package npvarselec. For variables that are a mixture of nominal and continuous variables, it seems that no specific non-parametric variable selection approaches are available to deal with this type of data.