Abstract
In order to contribute to the diagnosis of the ongoing breast cancer problem, a classifier has been developed using the Naive Bayes method based on data from breast cancer patients in the state of Wisconsin, USA. The proposed method has been compared with an academic article previously conducted with the same dataset and presented along with its results. Progress is being made every day in the early diagnosis of breast cancer. With the developed classifier having an accuracy of 96%, it is expected to play a significant role in early diagnosis.
1. Introduction
According to research, 2.1 million women are affected by breast cancer worldwide every year, with 20,000 women in our country. One in eight women is at risk of breast cancer throughout her lifetime, and one in 38 women is at risk of dying from breast cancer.
The human body consists of millions of cells, each with its own specific function. Healthy cells in our body have the ability to divide. While cells divide more rapidly in the early years of life, this rate slows down in adulthood. However, these abilities of cells are limited, they cannot divide infinitely. Each cell has a certain number of divisions throughout its life. A healthy cell knows how much it will divide and also knows when to die if necessary. Normally, for the healthy and smooth functioning of the body, there is a need for cells to grow, divide, and produce more cells. Sometimes, despite this, the process goes awry. Cells continue to divide without the need for new cells. The irregular growth of these cells causes cancer. Thus, cells divide and grow uncontrollably, forming an abnormal mass of tissue called a tumor. Although not every tumor is cancerous, they disrupt and invade the digestive, nervous, and circulatory systems, disrupting the normal functioning of the body.
Tumors can be benign or malignant. Benign tumors are not cancerous. They are often removed and rarely recur. Cells in benign tumors do not spread to other parts of the body. Most importantly, benign tumors rarely threaten life. Malignant tumors are cancerous. These tumors can compress, infiltrate, or destroy normal tissues. If cancer cells separate from the tumor they originated from, they can travel to other areas of the body through the blood or lymphatic circulation. In the places they go, they form tumor colonies and continue to grow. The spreading of cancer to other parts of the body in this way is called metastasis.
Breast cancer is a type of cancer that starts in breast cells. After lung cancer, it is the most common cancer worldwide. Although it occurs in men, cases in women are 100 times more common than in men. Since the 1970s, there has been an increase in the incidence of breast cancer, which is attributed to modern, Western lifestyles. The incidence in North America and European countries is higher than in other parts of the world.
If breast cancer is detected before spreading, the patient has a 96% chance of survival. Every year, one in 44,000 women dies from breast cancer. The best protective measure against breast cancer is early diagnosis.
The only way to determine whether a lump in the breast is benign or malignant is by biopsy and microscopic examination. But there are some features that can give the examining physician an average idea of what the mass looks like.
2. Methodology
A. Bayes' Theorem
Bayes' theorem is an important topic studied in probability theory. This theorem shows the relationship between conditional probabilities and marginal probabilities within a probability distribution for a random variable. This concept is also referred to as Bayes' Rule, Bayes' Theorem, or Bayes' Law.
In Bayes' Theorem, the occurrence of event A given the occurrence of event B can be described by Equation 2.1.
Where P(A) and P(B) are the marginal probabilities of events A and B, respectively.
Here, prior probability adds subjectivity to Bayes' theorem. In other words, for example, P(A) is the information available about event A before any data is collected. On the other hand, P(B|A) is a posterior probability because after collecting data, it provides information about the probability of event B occurring given that event A has occurred.
B. Naive Bayes Classifier
Naive Bayes classifiers are a collection of classification algorithms based on Bayes' Theorem. It is not a single algorithm, but a family of algorithms that share a common principle, namely that each feature pair being classified is independent of each other.
Consider each feature and class label as random variables. Given a set of features: the goal is to predict class Y based on the available data.
The objective function derived from Bayes' Theorem in 2.A is expressed by Equation 2.2.
Here, it is assumed that there are no dependencies between the features Xi (Equation 2.3).
The probability value for the classification column is calculated using Equation 2.4.
The probability value for categorical attributes is calculated using Equation 2.5.
C. Breast Cancer Dataset
In this project, the dataset published on UCI (University of California, Irvine) for breast cancer patients diagnosed in the state of Wisconsin, USA, was used.
Additionally, the dataset is available online at the Machine Learning Repository at the University of California, Irvine. There are 10 real-valued attributes, including radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension.
Attribute names and meanings are presented in Table 2.1.
Attribute Name | Attribute Description |
---|---|
id | Unique identifier |
diagnosis | Tumor diagnosis (B= Malignant, M= Benign) |
radius_mean | Mean of distances from center to points on the perimeter |
texture_mean | Standard deviation of gray-scale values |
perimeter_mean | Mean size of the core tumor |
area_mean | Mean of area that the tumor covers |
smoothness_mean | Mean of local variation in radius lengths |
compactness_mean | Mean of compactness ratios based on perimeter and area |
concavity_mean | Mean severity of concave portions of the contour |
concave_points_mean | Mean number of concave portions of the contour |
symmetry_mean | ## Unknown |
fractal_dimension_mean | "Mean for 'coastline approximation' - 1" |
radius_se | Standard error for the mean of distances from center to points on the perimeter |
texture_se | Standard error for standard deviation of gray-scale values |
perimeter_se | ## Unknown |
area_se | ## Unknown |
smoothness_se | Standard error for local variation in radius lengths |
compactness_se | Standard error for compactness ratios based on perimeter and area |
concavity_se | Standard error for severity of concave portions of the contour |
concave_points_se | Standard error for number of concave portions of the contour |
symmetry_se | ## Unknown |
fractal_dimension_se | "Standard error for 'coastline approximation' - 1" |
radius_worst | "Worst" or largest mean of distances from center to points on the perimeter |
texture_worst | "Worst" or largest standard deviation of gray-scale values |
perimeter_worst | ## Unknown |
area_worst | ## Unknown |
smoothness_worst | "Worst" or largest mean of local variation in radius lengths |
compactness_worst | "Worst" or largest mean of compactness ratios based on perimeter and area |
concavity_worst | "Worst" or largest mean of severity of concave portions of the contour |
concave_points_worst | "Worst" or largest mean of number of concave portions of the contour |
symmetry_worst | ## Unknown |
fractal_dimension_worst | "Worst" or largest mean for 'coastline approximation' - 1 |
Data set information:
- The dataset is characterized by being multivariate.
- Attributes consist of real numbers.
- There are 569 records (357 benign, 212 malignant).
- There are 32 attributes.
- There are no missing data in the records.
The algorithm used for data classification is presented with the flowchart shown in Figure 2.1.
All operations on the dataset were performed on a machine with 4GB RAM, AMD Phenom II X2 570 processor, and UBUNTU 20.04 LTS operating system.
3. Results
The developed model diagnoses benign (B) or malignant (M) tumors based on the results obtained from Bayes' theorem. There are correlations between the attributes in this dataset, and these correlations were ignored in the classifier used. 70% of all data were separated as the training set, and 30% as the test set.
After training the model with the training set, the accuracy value obtained was 95.321%.
The model was trained with training data split at 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 ratios, and accuracy values were obtained, creating a learning curve. The curve is presented in Figure 3.1.
For the dataset, 10-fold cross-validation technique was applied, and the values obtained were:
- Accuracy (mean): 93.717%
- Standard Deviation: 3.403%
ROC curves were drawn using the 10-fold technique for the model, and the areas under the curves were found to be: [0.997, 0.919, 0.984, 0.962, 0.989, 0.988, 0.959, 0.996, 0.992, 1].
The mean value and standard deviation of the areas under the curve:
- AUC (mean): 97.496%
- Standard Deviation: 2.388%
The AUC values plotted on the graph are presented in Figure 3.2.
The confusion matrix calculated for the test data of the model is:
- True Positive (TP): 60
- False Negative (FN): 5
- False Positive (FP): 3
- True Negative (TN): 103
Comparison of the results with a study conducted in 2019 using the same dataset and modeling method is presented in Table 3.1.
Metric | Current Paper | Other Academic Paper |
---|---|---|
Accuracy | 95.321% | 94.751% |
10-Fold Acc. | 93.677% | 93.435% |
10-Fold Std | 2.844% | 5.031% |
AUC Mean | 97.496% | 98.7% |
AUC Std | 2.388% | 1.4% |
Precision | 95.238% | 93% |
Recall | 92.307% | 98% |
F1 | 93.75% | 95% |
Sensitivity | 92.307% | 98% |
Specificity | 97.169% | 87% |
4. References
[1] NTV News - Breast cancer affects 2.1 million women worldwide every year
[2] Diagnosis of Breast Cancer using Decision Tree Data Mining Technique
[3] Ministry of Health - What is Cancer?
[4] Wikipedia - Breast Cancer
[5] Wikipedia - Bayes' theorem
[6] A New Score Fusion Approach for Breast Cancer Diagnosis
[7] Breast Cancer diagnosis using machine learning classification methods using Hadoop
SOURCES
DISCLAIMER
This document is translated from Turkish to English using ChatGPT. The original document is written by the in Turkish. The translation may contain errors and inaccuracies.