# An Intelligence-Based Health Biomarker Identification System Using Microarray Analysis

**Bibhuprasad Sahu, /. Chandrakanta Badajena, Amrutanshu Panigrahi, Chinmayee Rout, and Srinivas Sethi**

## Introduction

In biomedical experiments, researchers have to deal with large-scale databases in which the curse of dimensionality is a serious concern. We are dealing with microarray datasets during the experimental stage in which millions of genes are monitored. Different tools are adopted by many researchers; however, microarray technology is the one that is preferred most for genome expression monitoring purposes. From a literature survey it is clearly indicated that the microarray classification technique makes a noteworthy job of identifying cancer [1-4]. In microarray datasets the sample size is very small though it contains high dimension data, so it is very hard to deal with different existing methodologies which are implemented to obtain a test result, but which is not up to the mark. So we need to adopt reliable mining approaches to identify the redundant features in raw databases. For this, feature selection is needed at the preprocessing stage. We can identify the optimal set of features that can be extracted by feature selection techniques. Two basic techniques such as filter and wrapper are primarily used. In the case of the filter approach, the selection of features is done according to different constraints, such as the behavior of the feature or the content of the feature which is based on the intrinsic characteristics of the training set, where, as in a wrapper, it uses the performance of a classifier. As it uses the target classifier, the feature selection algorithm provides better performance but it is computationally expensive. Various kinds of wrapper approaches are used by many researchers, such as Tabu Search (TS), Harmony Search (HS) [8], Differential Evolution (DE) [6], Particle Swarm Optimization (PSO) [7], Genetic Algorithm (GA) [5], Artificial Bee Colony Algorithm (ABC) [9], and Ant Colony Optimization (ACO) [10].

In this research, we have suggested a binary shuffled frog-leaping algorithm with a meta-heuristic approach for feature selection for a high dimension microarray dataset. In the first stage (preprocessing stage) we extracted the top 250 genes from the original data and extracted the relevant gene subsets identified using binary shuffled frog-leaping algorithms with a meta-heuristic approach (PSO). Twenty-five subsets of genes were extracted within the range of 10-250 with 10 in each interval. The К-nearest neighbors (KNN) classifier was used to evaluate the classification accuracy performance, which was further checked using ANN and SVM.

The remainder of the chapter is organized as follows. Section 7.2 provides a literature survey of the different feature selection methods with a meta-heuristic approach. Section 7.3 describes the proposed approach using binary shuffled frog-leaping algorithms with a meta-heuristic approach for feature selection. Section 7.4 presents the detailed proposed algorithms. Section 7.5 presents the detailed experimental setup along with a simulating environment and result analysis of the proposed models. Section 7.6 concludes the work with suggested future research steps.

## Existing Knowledge

Many researchers have used various feature selection models for dealing with huge datasets. The techniques below are preferred by most, such as TS. HS [8], DE [6], PSO [7], GA [5], ABC [9], and ACO [10]. These models are used in various application areas. But in some cases due to premature convergence they reach towards the local optima. For the above techniques, exploitation and exploration should play a major role in implementation, which may not provide better accuracy. To avoid this limitation a binary shuffled frog-leaping algorithm with a metaheuristic approach for feature selection is used for feature selection (SFLA + META).

A modified discrete SFLA. used for the 01 knapsack problem, is demonstrated in [11]. An investigation was done with various experimental studies. For small and medium-size knapsack problems, a discrete SFLA produces remarkably better results and perhaps an alternative solution for large knapsack problems.

For the tri-objective gray environment, in [12] the author has suggested a simulated SFLA for a project selection schedule. A modified gray SFLA is proposed due to implementation of a time limit, budget constraint, and multiple objectives. The performance of the proposed algorithm is compared with NSGA-II and a multiobjective PSO for solving NP-hard problems.

To solve feature selection problems, a binary “feature selection algorithm based on bare-bones particle swarm optimization” was presented in [13]. As per the quantitative results it reveals that the proposed approach performs best average classification accuracies of 87.5%.

A new hybrid Ant Colony Optimization Algorithm for Feature Selection (ACOFS) was presented in [14]. It is claimed, as per the result analysis, that this proposed approach maintains an effective balance between exploration and exploitation of ants during searching and strengthens the global search capability of ant colony optimization for the realization of high-quality solutions in feature selection problems. Reported results of experimental testing show that ACOFS performs the remarkable ability to generate reduced-size subsets (of salient features) and improves classification accuracy.

A hybrid approach based on Particle Swarm Optimization and Support Vector Machines (PSO-SVM) with feature selection and parameter optimization to solve feature subset selection with the setting of kernel parameters is proposed by Huang et al. [15]. To minimize the computational cost, using web service technology a data mining system was implemented. The experimental results reveal that the proposed approach identifies features with high classification accuracy.

To improve the global search in an SFL algorithm. Dash et al. [16] proposed an opposition-based learning method. This algorithm improves the local search as well as increases diversity. The proposed algorithm was tested with ten different benchmark optimization functions.

In [17] the authors used SFLA as a meta-heuristic approach and applied it to many combinational problems. But this method is unable to achieve global optima and falls into local optima. So a new SFL with a Levy fight based algorithm was proposed and tested with 30 benchmark functions. Result analysis claims that it performs better.

This literature survey inspires the implementation of a binary SFLA with a metaheuristic approach (BSFLA-META) for dimensionality reduction of a microarray dataset. We propose a novel hybrid optimization method called Binary SFLA-PSO, which introduces PSO to BSFLA by combining the fast convergence speed of PSO and the global search strategy of Binary SFLA. Frog leaping in the Binary SFLA does not impose a limit length, w'hich helps the frog to get out of the local optimum.

After the shuffling process, the frog with the worst fitness of the whole population is substituted by the best solution searched by PSO, which aids the population to evolve more efficiently.

## Classification Model

FIGURE 7.1 Proposed model.

## Approaches for Feature Selection

In this section, we have implemented BSFLA-PSO. To identify 250 pre-selected genes before BSFLA-PSO, we planned for pre-feature selection.

### Shuffled Frog-Leaping Algorithm and Particle Swarm Optimization (SFLA-PSO)

The Shuffled Frog-Leaping Algorithm (SFLA) is a recent evolutionary algorithm (EA) or meta-heuristic algorithm proposed by Eusuff and Lansey in 2003. It is a combination algorithm of both social behavior PSO and a genetic-based mimetic algorithm. A population of NP frogs represented the initial solution. Individual solution

X, is represented by Z dimensional vectors, = *(x _{it},x_{l2}, ...,x_{ip})* where / e {1.2.....

*NP}.*

Evaluation of the objective function of each frog is defined, and according to their fitness value frogs are arranged in decreasing order. The whole population is divided into *m* number of memeplexes. This is a totally circular type of activity where the first frog goes to the first memeplex; the second frog goes to the second memeplex ..the /nth frog goes to the *m* memeplex. and the *m* + 1 th frog goes to the first memeplex [18]. While searching for food, communication between frogs occurs in two ways, either as inter-cluster or intra-cluster. Intra-cluster communication is also called local exploration, and inter-cluster communication is called global communication [19]. The frog with the best fitness is denoted л_{?} and the worst *x _{w}* according to their fitness value. During the evolution process the frog having the worst fitness updates its position so that it can move towards an optimal solution.

Here *r* represents a random number that varies between 0 and 1, Z_{max} is the maximum length to consider for the frog’s position, and x„,_{<(IW)} gets an update with respect to the old value of *х^ _{М)}.* If there is no improvement then x„. remains constant and is replaced by randomly generated solutions. Each memeplex follows L

_{mox}number of iterations after the shuffling process is carried out.

Kennedy and Eberhart [20] in 1995 proposed a population-based optimization algorithm. Each population consists of NP particles called a swarm. The position value and velocity value of the /'th position are represented in dimension Z, which determines the search space. The position value is represented as *x*, = *(x _{n},x_{i2},*...,jc

_{iz}) and the velocity value represented by v, = (v

_{fl},v

_{i2},..., v,_). The objective function is the main criterion for evaluating the performance of each particle. The particle’s previously visited position is represented as

*p, = (p*.and the best previous

_{n},p_{a},position of each individual swarm is represented as *p _{g}* =

*(p*.....

_{gi},p_{g2}*P*Let “itr” be

_{gz})-the iteration number. So the swarm velocity can be represented as:

where *i* = 1, 2, 3, ..., NP, w is the inertia weight within the range (0, 1), c,, c, represents constants and learning factors, and r„ *r _{2}* represents a random number varying between (0, 1). Equation (7.3) represents the velocity (new) calculated from the previous velocity and the distance between the best position and current position of the particles within the population.

Equation (7.4) presents a new position if the particle changes its position in the population. The value of *w* (local exploration) can be computed as:

where *itr* represents the current iteration number and *itrinax* represents the number of iterations allowed.

### The Advantage of SFLA

SFLA is a very simple algorithmic structure because it has less controlling parameters, but it suffers from limitations: the population searching decreases; the worst solution is updated instead of the best; and the convergence speed of the algorithm is very slow. To avoid the limitations of basic SFLA by improving frog mapping size the search learning coefficient (S) is used to change the movement of the frog during a local search. The rule of leaping of the frog is:

*Stp* represents step size at *t* instance of the frog step size’s previous worst position iteration, *cp* represents the inertia required to balance the search process. The value *cp *decreases repeatedly from greater to smaller.

Binary shuffled frog-leaping encoding (0/1) is used for the presence and absence of the frog respectively. Each frog has two basic components such as velocity and position. The position of the frog gets updated with respect to its own position and the position of other frogs. The updated frog position is changed into a continuous value, and frogs are sorted into their fitness value. Accordingly, the whole population is divided into memeplexes and the updating of the frog in the inter-cluster occurs throughout the population. Finally, the best frog is identified.

### Algorithm for BSFLA-PSO

For this algorithm, we considered the parameters of population size (25), the number of memeplexes (5), intra-updates of memeplexes (8), and the number of improvisations (100).

Step-1: Selection of gene expression dataset. Using Min-Max normalization normalizes the dataset. Selection of best 250 genes.

Step-2: Initialization of variables such as number of memeplexes *(m):* number of populations («), improvisation with inter-cluster memeplexes (nim), number of improvisations (Ni).

Step-3: Randomly generated NP frogs as the initial population. Consider the indices either (0/1). Initialization iteration as 0.

Step-4: Using KNN classifier, calculate fitness function *F,.* The frog with the highest fitness will be in the first memeplex, and so on.

Step-5: Set / = 1 and start a local search within inter-cluster.

Step-5.1: Set *j =* 1 and start each memeplex. Find the best and worst fitness of the whole population represented as *F _{g}b* and

*F*Step-5.2: Comparisons of frog fitness value between new and current. If new is greater than current then go to 8 or identify the position of the worst frog.

_{gw}.Step-5.3: Apply wPSO with iteration number set as 20 * *L _{max}* and find the best fitness

*(X*the worst frog in population is replaced

_{gPS0}).by *Xgpso-*

Step-5.4: Increment *j* to 1 and repeat until *j =* nim. Continue this until / = *m. *Step-6: Check if the convergence criteria are satisfied or not with the population.

Step-7: Parameter 5 and

value calculated as follows

Step-8: Repeat steps 4-7 until iteration value not equal to Ni.

Step-9: Identify the best frog.