Blog > Naive Bayes Classification Method for Breast Cancer Diagnosis

Naive Bayes Classification Method for Breast Cancer Diagnosis

A Naive Bayes classifier has been developed to contribute to the diagnosis of breast cancer. It is based on data from breast cancer patients in Wisconsin, USA. The classifier has been compared with an academic article using the same dataset, achieving an accuracy of 96%. It is expected to play a significant role in early diagnosis.

Published: March 19, 2021

Last updated: March 30, 2024

Abstract

In order to contribute to the diagnosis of the ongoing breast cancer problem, a classifier has been developed using the Naive Bayes method based on data from breast cancer patients in the state of Wisconsin, USA. The proposed method has been compared with an academic article previously conducted with the same dataset and presented along with its results. Progress is being made every day in the early diagnosis of breast cancer. With the developed classifier having an accuracy of 96%, it is expected to play a significant role in early diagnosis.

1. Introduction

According to research, 2.1 million women are affected by breast cancer worldwide every year, with 20,000 women in our country. One in eight women is at risk of breast cancer throughout her lifetime, and one in 38 women is at risk of dying from breast cancer. $^{[1]}$

The human body consists of millions of cells, each with its own specific function. Healthy cells in our body have the ability to divide. While cells divide more rapidly in the early years of life, this rate slows down in adulthood. However, these abilities of cells are limited, they cannot divide infinitely. Each cell has a certain number of divisions throughout its life. A healthy cell knows how much it will divide and also knows when to die if necessary. Normally, for the healthy and smooth functioning of the body, there is a need for cells to grow, divide, and produce more cells. Sometimes, despite this, the process goes awry. Cells continue to divide without the need for new cells. The irregular growth of these cells causes cancer. Thus, cells divide and grow uncontrollably, forming an abnormal mass of tissue called a tumor. Although not every tumor is cancerous, they disrupt and invade the digestive, nervous, and circulatory systems, disrupting the normal functioning of the body. $^{[2]}$

Tumors can be benign or malignant. Benign tumors are not cancerous. They are often removed and rarely recur. Cells in benign tumors do not spread to other parts of the body. Most importantly, benign tumors rarely threaten life. Malignant tumors are cancerous. These tumors can compress, infiltrate, or destroy normal tissues. If cancer cells separate from the tumor they originated from, they can travel to other areas of the body through the blood or lymphatic circulation. In the places they go, they form tumor colonies and continue to grow. The spreading of cancer to other parts of the body in this way is called metastasis. $^{[3]}$

Breast cancer is a type of cancer that starts in breast cells. After lung cancer, it is the most common cancer worldwide. Although it occurs in men, cases in women are 100 times more common than in men. Since the 1970s, there has been an increase in the incidence of breast cancer, which is attributed to modern, Western lifestyles. The incidence in North America and European countries is higher than in other parts of the world.

If breast cancer is detected before spreading, the patient has a 96% chance of survival. Every year, one in 44,000 women dies from breast cancer. The best protective measure against breast cancer is early diagnosis.

The only way to determine whether a lump in the breast is benign or malignant is by biopsy and microscopic examination. But there are some features that can give the examining physician an average idea of what the mass looks like. $^{[4]}$

2. Methodology

A. Bayes' Theorem

Bayes' theorem is an important topic studied in probability theory. This theorem shows the relationship between conditional probabilities and marginal probabilities within a probability distribution for a random variable. This concept is also referred to as Bayes' Rule, Bayes' Theorem, or Bayes' Law.

In Bayes' Theorem, the occurrence of event A given the occurrence of event B can be described by Equation 2.1.

P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}

Equation 2.1 Bayes' Theorem

Where P(A) and P(B) are the marginal probabilities of events A and B, respectively.

Here, prior probability adds subjectivity to Bayes' theorem. In other words, for example, P(A) is the information available about event A before any data is collected. On the other hand, P(B|A) is a posterior probability because after collecting data, it provides information about the probability of event B occurring given that event A has occurred. $^{[5]}$

B. Naive Bayes Classifier

Naive Bayes classifiers are a collection of classification algorithms based on Bayes' Theorem. It is not a single algorithm, but a family of algorithms that share a common principle, namely that each feature pair being classified is independent of each other.

Consider each feature and class label as random variables. Given a set of features: the goal is to predict class Y based on the available data.

The objective function derived from Bayes' Theorem in 2.A is expressed by Equation 2.2.

P(Y|X_1, X_2, \ldots, X_d) = \frac{P(X_1, X_2, \ldots, X_d|Y) \times P(Y)}{P(X_1, X_2, \ldots, X_d)}

Equation 2.2 Objective function

Here, it is assumed that there are no dependencies between the features Xi (Equation 2.3).

P(X_1, X_2, \ldots, X_d|Y_j) = P(X_1|Y_j) \times P(X_2|Y_j) \times \ldots \times P(X_d|Y_j)

Equation 2.3 Bayes Expansion

The probability value for the classification column is calculated using Equation 2.4.

P(X) = N_X / N

Equation 2.4 Probability calculation formula for classification column

The probability value for categorical attributes is calculated using Equation 2.5.

P(X_i|Y_k) = X_{ik} / N_{Xk}

Equation 2.5 Probability calculation formula for categorical attributes

C. Breast Cancer Dataset

In this project, the dataset published on UCI (University of California, Irvine) for breast cancer patients diagnosed in the state of Wisconsin, USA, was used.

Additionally, the dataset is available online at the Machine Learning Repository at the University of California, Irvine. There are 10 real-valued attributes, including radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. $^{[6]}$

Attribute names and meanings are presented in Table 2.1.

Attribute Name	Attribute Description
id	Unique identifier
diagnosis	Tumor diagnosis (B= Malignant, M= Benign)
radius_mean	Mean of distances from center to points on the perimeter
texture_mean	Standard deviation of gray-scale values
perimeter_mean	Mean size of the core tumor
area_mean	Mean of area that the tumor covers
smoothness_mean	Mean of local variation in radius lengths
compactness_mean	Mean of compactness ratios based on perimeter and area
concavity_mean	Mean severity of concave portions of the contour
concave_points_mean	Mean number of concave portions of the contour
symmetry_mean	## Unknown
fractal_dimension_mean	"Mean for 'coastline approximation' - 1"
radius_se	Standard error for the mean of distances from center to points on the perimeter
texture_se	Standard error for standard deviation of gray-scale values
perimeter_se	## Unknown
area_se	## Unknown
smoothness_se	Standard error for local variation in radius lengths
compactness_se	Standard error for compactness ratios based on perimeter and area
concavity_se	Standard error for severity of concave portions of the contour
concave_points_se	Standard error for number of concave portions of the contour
symmetry_se	## Unknown
fractal_dimension_se	"Standard error for 'coastline approximation' - 1"
radius_worst	"Worst" or largest mean of distances from center to points on the perimeter
texture_worst	"Worst" or largest standard deviation of gray-scale values
perimeter_worst	## Unknown
area_worst	## Unknown
smoothness_worst	"Worst" or largest mean of local variation in radius lengths
compactness_worst	"Worst" or largest mean of compactness ratios based on perimeter and area
concavity_worst	"Worst" or largest mean of severity of concave portions of the contour
concave_points_worst	"Worst" or largest mean of number of concave portions of the contour
symmetry_worst	## Unknown
fractal_dimension_worst	"Worst" or largest mean for 'coastline approximation' - 1

Table 2.1 Attribute names and meanings

Data set information:

The dataset is characterized by being multivariate.
Attributes consist of real numbers.
There are 569 records (357 benign, 212 malignant).
There are 32 attributes.
There are no missing data in the records.

The algorithm used for data classification is presented with the flowchart shown in Figure 2.1.

Figure 2.1 Flowchart of the Naive Bayes classification method

All operations on the dataset were performed on a machine with 4GB RAM, AMD Phenom II X2 570 processor, and UBUNTU 20.04 LTS operating system.

3. Results

The developed model diagnoses benign (B) or malignant (M) tumors based on the results obtained from Bayes' theorem. There are correlations between the attributes in this dataset, and these correlations were ignored in the classifier used. 70% of all data were separated as the training set, and 30% as the test set.

After training the model with the training set, the accuracy value obtained was 95.321%.

The model was trained with training data split at 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 ratios, and accuracy values were obtained, creating a learning curve. The curve is presented in Figure 3.1.

Figure 3.1 Learning curve of the Naive Bayes classification method

For the dataset, 10-fold cross-validation technique was applied, and the values obtained were:

Accuracy (mean): 93.717%
Standard Deviation: 3.403%

ROC curves were drawn using the 10-fold technique for the model, and the areas under the curves were found to be: [0.997, 0.919, 0.984, 0.962, 0.989, 0.988, 0.959, 0.996, 0.992, 1].

The mean value and standard deviation of the areas under the curve:

AUC (mean): 97.496%
Standard Deviation: 2.388%

The AUC values plotted on the graph are presented in Figure 3.2.

Figure 3.2 AUC values of the Naive Bayes classification method

The confusion matrix calculated for the test data of the model is:

True Positive (TP): 60
False Negative (FN): 5
False Positive (FP): 3
True Negative (TN): 103

Comparison of the results with a study conducted in 2019 using the same dataset and modeling method is presented in Table 3.1. $^{[7]}$

Metric	Current Paper	Other Academic Paper
Accuracy	95.321%	94.751%
10-Fold Acc.	93.677%	93.435%
10-Fold Std	2.844%	5.031%
AUC Mean	97.496%	98.7%
AUC Std	2.388%	1.4%
Precision	95.238%	93%
Recall	92.307%	98%
F1	93.75%	95%
Sensitivity	92.307%	98%
Specificity	97.169%	87%

Table 3.1: Comparison of Data (Specificity and sensitivity values are not specifically mentioned in the paper but calculated from the confusion matrix. Bolded values imply higher reliability.)

4. References

[1] NTV News - Breast cancer affects 2.1 million women worldwide every year
[2] Diagnosis of Breast Cancer using Decision Tree Data Mining Technique
[3] Ministry of Health - What is Cancer?
[4] Wikipedia - Breast Cancer
[5] Wikipedia - Bayes' theorem
[6] A New Score Fusion Approach for Breast Cancer Diagnosis
[7] Breast Cancer diagnosis using machine learning classification methods using Hadoop

SOURCES

DISCLAIMER

This document is translated from Turkish to English using ChatGPT. The original document is written by the in Turkish. The translation may contain errors and inaccuracies.