Cluster Analysis of Breast Cancer Data Using Modified BP-RBFN

Table of Contents:


According to the World Health Organization, breast cancer has become a common cancer type among women. It has been reported that in 2018 627,000 w'omen died from breast cancer, which is approximately 15% of all cancer deaths among women. There are different modes of treatment for cancer, such as surgery, chemotherapy and radiation therapy [1]. Though there are treatments in the present scenario, early diagnosis will help toward a successful prognosis in a cancer case [2]. The treatment modalities differ for each stage of cancer. Availability of treatment and early diagnosis have increased the survival rate of cancer patients. Data-driven models for analysing the survivability of patients will help cancer management across the globe. Analysing the patterns that emerge from a group of data will undoubtedly help in knowledge discovery related to the diagnosis. Clustering is one such technique that divides the data into meaningful groups and aids in analysing the patterns that come out of it. Given a set of data points, clustering groups those with similar kinds of data points. The ultimate aim of clustering is to group unlabelled data innately. Clustering falls under unsupervised learning and helps researchers gain valuable insights from the given data.

This chapter proposes a data-driven model for breast cancer data analysis. Initially, a dataset from the UCI machine learning repository, Wisconsin Breast Cancer (Diagnostic) Dataset [3], is passed to the feature selection module. The dataset may contain certain features that will not contribute to the end analysis. In such cases, the dataset undergoes feature selection to eliminate the merely essential features. There are many feature selection methods in the literature. In clustering, they are classified as filter methods [4-7], wrapper methods [8, 9], hybrid methods [10-12] and embedded methods [13]. Each technique differs from the other with regard to its search strategy and selection criteria. Filter methods do not have criteria for selection. They are heuristic in nature and problem-specific. Euclidean distance [7] is used for its searching strategy. Wrapper methods use data mining techniques-based induction algorithms for their selection. Hybrid methods combine both wrapper and filter methods. They wuap the data using induction techniques and filter them using searching. Finally, once the essential features are selected, they are passed on to a clustering model. There are many clustering algorithms in the literature. In this chapter, a back propagation-based radial basis function neural network is used to group the unlabelled data based on the selected important features. The network is trained using six training algorithms to determine the best training algorithm. Clustering results obtained using all the training algorithms are analysed using performance measures like MSE index, the best performance in terms of time taken, best epoch, regression fit. The performance of BP-RBF under different training algorithms is analysed used the regression plot and MSE plot.


Many works were found in the literature about cluster analysis of medical data. Dubey et al. [14] analysed data from the Wisconsin Breast Cancer Dataset using the к-means algorithm with computation measures like distance, centroid, epoch, split method, attribute and iteration to identify the correct metric that gives higher accuracy when used for diagnosis. An analytical model for predicting the survival rate of breast cancer patients was developed by Shukla et al. [15] by creating cohort clusters using unsupervised data mining methods. Using methods like SOM (Self-Organising Map) and DBSCAN (Density-Based Spatial Clustering of Applications wdth Noise), cohort clusters were formed. An MLP (Machine Learning Platform) is trained using these cohort clusters to analyse and predict the survivability of patients. Montazeri et al. [16] developed a rule-based classification method for predicting different types of breast cancer survival. Patterns that are found among datasets are used to determine the outcome of a disease. Popular machine learning techniques like Naive Bayes, nearest neighbour, support vector machine (SVM) with 10-fold cross validation technique were utilised along with the proposed model for prediction. Vivek Kumar et al. [17] analysed different classification techniques with conventional machine learning techniques. From the analysis, a prediction model with the best machine learning technique may be built for accurate prediction. Ahmed Idbal Pritom et al. [18] proposed an efficient feature selection and classification method for breast cancer prediction. Ranker method-based feature selection was used to rank the features. Using the N best-selected features, naive Bayes, C4.5 decision tree, and SVM were used to proceed with classification. Qiqige Wuniri et al. [19] proposed a generic-driven Bayesian classifier for breast cancer classification. The method was proposed to handle both the discrete and continuous types of features. Genetic algorithm (GA) was used to determine optimal feature subsets with a good area under curve metric. A GA-based feature selection was proposed by Fadzil Ahmad et al. [20] for diagnosing breast cancer. The GA-based selected features were classified using a parameter optimised artificial neural network (ANN) for further diagnosis. The network was trained with different training algorithms for performance analysis. Sirage Zenyu et al. [21] proposed a method for prediction of chronic kidney disease using ensemble methods. Info gain subset evaluator and wrapper subset evaluator were used to perform feature selection. Diagnosis was performed using к-nearest neighbour, J48, ANN, naive Bayes and SVM. A breast cancer prediction system was proposed by Alickovic et al. [22] using GA- based feature selection and random forest-based classifier. An accuracy of 98% was obtained after feature selection. F-score method-based feature selection for breast cancer diagnosis was proposed by Akay et al. [23]. SVM was used for further classification of breast cancer data. In order to balance the training data, an undersampling technique was proposed by Liu et al. [24] for breast cancer prediction. Prediction was performed using decision tree at greater accuracy, particle swarm optimisation-based diagnosis system was designed by Sheikhpor et al. [25] for breast cancer prediction. A nonparametric kernel density estimation was integrated with particle swarm optimisation to achieve greater results. Kahkashan Kouseret al. [26] proposed a genetic algorithm-based feature selection method for clustering high-dimensional data. Initially the traditional к-means clustering algorithm was applied on the full dimension space and compared with the GA-based method. Zhiwen yu et al. [27] proposed a genetic based к-means algorithm for feature selection. The parameters of the algorithm were initialised using a weighting junction for better performance.

Many algorithms exist for feature selection, and all these processes pick the best features out of the existing feature set. Existing methods are broadly classified as filter and wrapper methods. The third class of method called the hybrid method combines both filter and wrapper method to get the best results. In the

TABLE 12.1

Brief Survey of Popular Feature Selection Methods



Method Category




Modified bat algorithm


Wisconsin Diagnosis Breast Cancer Dataset



GA-based feature selection


Wisconsin Breast Cancer Dataset. Wisconsin Diagnosis Breast Cancer Dataset,

Wisconsin Prognosis Breast Cancer Dataset



Graph-based feature selection


Wisconsin Diagnosis Breast Cancer Dataset






Wisconsin Breast Cancer Dataset



Wrapper-based gene selection with Markov blanket


Colon. SRBCT. Leukemia. DLBCL. Bladder. Prostrate, Tox, Blastomi


literature, many hybrid methods are proposed. Table 12.1 lists out some of the popular methods.

< Prev   CONTENTS   Source   Next >