Areas Utilization Naive Bayes Algorithms

  • Identifying and Solving Worldwide Problems: In the fast-moving world, naive Bayes is an learning classifier that can be used for real-world prediction.
  • Various Leveled Classes Analysis: This analysis method is absolutely suited for various leveled classes analysis or prediction. Predictions are made on the different feasibility of various classes of destination variable.
  • Text Regulation/Spam Refinement/Emotional Analysis: Naive Bayes classifiers are commonly implemented in text regulation for better outcomes in various level class problems. They give a better output rate when compared to other popular algorithms. This algorithm is popularly used in Spam refining to extract spam e-mail and also for emotional analysis in the field of social media analysis, to analyze the pros and cons of customer sentiment.
  • Suggestion System: Recommendation systems are developed using two different approaches like the naive Bayes algorithm and the collaborative filtering method, in which machine learning methods and data mining techniques are applied to extract unknown information and determine whether an end user would accept a given resource or not.

Building Primary Modules Using Naive Bayes

The Python library scikit is used in the research process to build a naive Bayes model. The scikit-learn (sklearn) library includes three different forms of the naive Bayes model:

• Bernoulli: The binomial model is effective when feature vectors are binary (i.e., zeros and ones). A simple, real-time example would be text analysis, i.e., a pool of content condition. In this case Is and Os are “word reappears in the document” and “word does not reappear in the document.”

System Workflow Architecture

Exploratory Data Analysis

Jupiter Notebook is used to work on the dataset by importing necessary libraries and by importing the dataset to Jupiter Notebook

Splitting the Dataset

The data is subdivided into training data and test data. The training set usually poses predicted outcome on the basis of which the program learns to make predictions on a new data file. The test data (or subset) is used to test the predictions made by the program and this is done using sklearn library in Python using the train_test_split method.

Data Wrangling

This step loads in the data, checks for cleanliness, and then trims and cleans the given dataset for analysis.

Data Gathering

The data gathered for predicting patient situations is subdivided into a training set and a test set. Broadly, 7:3 classification is implemented to categorize the training dataset and the test dataset. The module designed by programmers using the naive Bayes algorithm is evaluated using the training set and, based on the outcome efficiency, the test dataset is applied for further predictions.


The obtained records probably have some missing records resulting in data inconsistency. To achieve an accurate result, the data needs to be preprocessed. The errors have to be neglected and data modifications need to be carried out. Based on the interrelationship among aspects, it was observed that the significantly individual aspects include TNM, stages, grade, and age, w'hich is the strongest among all.

Building the Classification Model

For predicting breast cancer, a decision tree algorithm is most appropriate because an appropriate outcome can be expected in classification as follows:

  • • It is efficient when reprocessing errors in the data, irrelevant variables, and a combination of all type of continuous, categorical, and discrete variables.
  • • It results in a huge amount of expected mistakes; those that are proved to be improper through many tests.

Training the Data

  • • In step 1, import the iris data set that is available ready to incorporate in the sklearn module. Raw data that is available usually constitutes different varieties of variables.
  • • For example, import any method and train_test_split class from sklearn and NumPy module to be incorporated in this kind of program.
  • • To inculcate load_data() method in data_dataset variables further, a train_ test_split method is incorporated to classify the data into Training and test data. The N prefix is the variable that explains the feature values and Y prefix represents target values.
  • • As explained earlier, the raw data is classified into training and test data, probably in the ratio of 67:33/70:30. Later this data is incorporated in the developed module.
  • • Finally, the training data set is classified used to train the module for further analysis and to produce the best result.

Testing the Dataset

  • • In this phase, various aspects of features in NumPy array are represented using variable 'n' to report the different aspects of the feature. To perform this the prediction method applies NumPy array works on entering data and splits out the desired value as a result.
  • • Finally, the identified target value is obtained to be 0 as a result. Furthermore, to analyze the result, score the results as a ratio of the number of identifications obtained as absolute to the overall identification obtained. Finally, determine an evaluation score which evaluates the real values and the predicted values.

System Workflow

According to this diagram when the user enters the data, errors in the data are first removed and then the algorithm is applied to determine the TNM levels and grade cells that further decide the patient stages (Figure 11.7).

System workflow diagram

FIGURE 11.7 System workflow diagram.

Implementation of Naive Bayes Algorithm Using Anaconda Software

The algorithm is implemented in the dataset to determine the stage of the breast cancer in the affected patient by considering the attribute as important aspect for analysis. Now', according to the given dataset. Bayes’ theorem can be applied in following way:

Where Y is variable class and N is feature vector which is dependent of size n such that: N = (nl, n2, n3.............)

Basically, according to the study, P (NIY) represents, the probability of “Not class” given the patient conditions attributes are TNM, stage, and age.

Generally, one of the types of methodology that comes under the naive Bayes is the Gaussian naive Bayes. It is particularly used in cases where the character of the variable is regular. It w'orks on the assumption that all the characters are of Gaussian distribution, which is also called the normal distribution.

For example, applying the Gaussian naive Bayes classifier using sklearn:

# Bundle the iris dataset Import pandas as p

iris = p.reannid_csv("dataset.csv")

# storing the feature matrix (N) and response vector (Y)

N =

Y =

# Dividing N and Y as training and testing datasets from sklearn.model_selection import train_test_split N_train, N_test, Y_train, Y_test = train_test_split(N, Y, test_size=0.4, random_state=l)

# program is trained using training dataset from sklearn.naive_bayes import GaussianNB gnb = GaussianNB(), Y_train)

#prediction made on testing data Y_pred = gnb.predict(N_test)

# correlating definite response values (Y_test) with identified response values (Y_pred)

Y from sklearn import metrics

print("Gaussian Naive Bayes model accuracy(in metrics.

accuracy_score(y_test, y_pred)*100)

Evaluating the Accuracy of Software Modules

Confusion Matrix

A confusion matrix is a tabular representation that is commonly used to evaluate the performance of a classification model (or “classifier”) on a given test data for which the true values are identified earlier. The confusion matrix is not too complex to understand, but the related terms used are bit confusing.

Basic Terms

True Positives (TP): A person who will not pay is predicted as a defaulter. These are values that are predicted correctly called positive values; as such the outcome of both the real and identified class is positive.

True Negatives (TN): A person who defaults is predicted as a payer. These are values that are predicted correctly as negative values; as such the outcome of both the real and identified class is negative.

False Positives (FP): A person who will pay is predicted as a defaulter. In this case the real class value is negative and the identified class value is positive.

False Negatives (FN): A person who will default is predicted as a payer. In this case the actual class value is yes but the predicted class value is no.

Comparing the Algorithm with Prediction in the Form of Best Accuracy Result

Many machine learning algorithms are developed by many programmers; it is necessary to compare these machine learning models. Automated testing methods are implemented to correlate different machine learning algorithms in Python with scikit- learn. The automated analysis testing methods are used as examples of many different kinds of algorithms. Performance varies in each module. Implementing resampling methods is like cross validation; it can determine how accurately each model functions on unknown data. Looking at data from a different angle while selecting it, the same methodology is applicable to model selection. Different accuracy calculation methods are incorporated to determine the best selection.

In this section you will see the procedure to carry out the same in Python sklearn. The basis of a valid correlation of machine learning algorithms is to ensure that each algorithm is examined in the same procedure on the equal data and this can be accomplished by inculcating every algorithm to be calculated based on an efficient automated test methods.

Prediction Result by Accuracy

Logistic regression determines a value using linear equation with independent predictors. The predicted amount can be any combination of false infinity to true infinity [5]. The required outcome of the algorithm is to be classified variable data. Higher accuracy when predicting the result is determined with a logistic regression model by comparing the best accuracy.

Accuracy: Accuracy is defined as the proportion of the total number of predictions that is correct; alternatively, how often the model correctly predicts defaulters and non-defaulters overall.

Accuracy Calculation

Accuracy is the most effective performance method used to determine the ratio of correct predicted observation to the total observations. It is a common assumption that if there is high accuracy then the model is considered to be best. This assumption is correct; accuracy is a high performance measure when there is a symmetric datasets in which values of false positive and false negatives are almost similar.

Precision: Precision is the proportion of positive predictions that are actually correct. (When the model predicts default: how often is correct?)

Precision is calculated by dividing correctly predicted positive observations with total predicted positive observations.

Recall: Recall (sensitivity) is the proportion of positive observed values correctly predicted.

Recall is calculated by dividing correctly predicted positive observations with all observations in actual class—yes.

FI Score is calculated by determining the weighted average of precision and recall. Therefore, this calculation takes both false positives and false negatives into consideration. Learning this is not as simple a task as accuracy, but FI is actually more effective than accuracy, mainly in the case of uniform class classification. Accuracy is effective only when false positives and false negatives pose same cost. If false positives and false negatives vary, the best choice is to consider both precision and recall.

General Formula:

Fl-Score Formula:

Results of Testing Using Confusion Matrix

Results are analyzed using confusion matrix to measure the performance of machine learning algorithm. Here, the row represents the actual class and the column represents the predicted class. The fields in the confusion matrix represents True positive. True negative. False positive, and False Negative (Table 11.1).

TABLE 11.1 Confusion Matrix



















Breast cancer is a disease caused by abnormal cells growing in the breast numerously. There are various types of breast cancer. The type of breast cancer is determined by which cells in the breast change into cancer. Cancer can affect different areas of the breast. Most breast cancer starts in the ducts or lobules. Breast cancers can move outside the breast by blood vessels and lymph vessels. When breast cancer moves to other organs of the body, it is considered to be metastasized.

These cells usually develop into tumor cells that are visible on an X-ray or felt as a lump. The tumor is malignant (cancer) if the cells can develop into (invade) surrounding tissues or move (metastasize) to different organs of the body. Breast cancer is most common among women, but rarely men can get breast cancer.

Machine learning is technology in the field of AI that encapsulate a wide variety of statistical, probabilistic, and optimization techniques that enable computer models to develop expertise from previous examples. This is especially capable and well-suited to use in medical applications whose main factors depend on complex proteomic and genomic measurements. In short, machine learning is commonly used in cancer diagnosis and detection. Machine learning is now also used in cancer prognosis and prediction.

In this chapter, the machine learning classifier naive Bayes classifier is implemented to determine the stage of a breast cancer patient and to grade the cell size. The TNM system is used to explore the stage. Accurate performance of the naive Bayes algorithm is evaluated by calculating the accuracy, specificity, sensitivity, and FI score as shown in Table 11.2 [5].

TABLE 11.2

Performance of Test Data





FI Score












  • 1. Ganesh N. Sharma. Rahul Dave. Jyotsana Sanadya. Piush Sharma, and К. K. Sharma. “Various Types and Management of Breast Cancer: An Overview.” Journal of Advanced Pharmaceutical Technology & Research, vol. 1(2). Apr-Jun 2010, PMCID: PMC3255438. PMID: 22247839.
  • 2. Osvaldo Simeone, “A Very Brief Introduction to Machine Learning With Applications to Communication Systems.” Institute of Electrical and Electronics Engineers, vol. 4(4). Nov 2018. doi: 10.1109/TCCN.2018.2881442
  • 3. Yi-Sheng Sun, Zhao Zhao. Zhang-Nv Yang, Fang Xu. Hang-Jing Lu, Zhi-Yong Zhu, Wen Shi. Jianmin Jiang. Ping-Ping Yao, and Han-Ping Zhu, “Risk Factors and Preventions of Breast Cancer," International Journal of Biological Science, vol. 13(11). Nov 2017, doi: 10.7150/ijbs.21635, PMCID: PMC5715522, PMID: 29209143.
  • 4. Jieun Koh, and Min Jung Kim, “Introduction of a New Staging System of Breast Cancer for Radiologists: An Emphasis on the Prognostic Stage”, Korean Journal of Radiology, vol. 20(1), 27. Dec 2019, doi: 10.3348/kjr.2018.0231. PMCID: PMC6315072, PMID: 30627023.
  • 5. D. R. Umesh. and C. R. Thilak. “Predicting Breast Cancer Survivability Using Naive Baysien and C5.0 Algorithm,” International Journal of Computer Science and Information Technology Research, vol. 3(2). 802-807. April-June 2015, ISSN 2348- 1196 (print), ISSN 2348-120X (online). Available at:
< Prev   CONTENTS   Source   Next >