Breast cancer prediction model with decision tree and adaptive boosting

Received Dec 25, 2019 Revised Oct 10, 2020 Accepted Jan 4, 2021 In this study, breast cancer prediction model is proposed with decision tree and adaptive boosting (Adboost). Furthermore, an extensive experimental evaluation of the predictive performance of the proposed model is conducted. The study is conducted on breast cancer dataset collected form the kaggle data repository. The dataset consists of 569 observations of which the 212 or 37.25% are benign or breast cancer negative and 62.74% are malignant or breast cancer positive. The class distribution shows that, the dataset is highly imbalanced and a learning algorithm such as decision tree is biased to the benign observation and results in poor performance on predicting the malignant observation. To improve the performance of the decision tree on the malignant observation, boosting algorithm namely, the adaptive boosting is employed. Finally, the predictive performance of the decision tree and adaptive boosting is analyzed. The analysis on predictive performance of the model on the kaggle breast cancer data repository shows that, adaptive boosting has 92.53% accuracy and the accuracy of decision tree is 88.80%, Overall, the adaboost algorithm performed better than decision tree.


INTRODUCTION
Breast cancer is caused by an abnormal growth and cell division in the breast tissues without control. The abnormal growth of the cells is called a tumor and results in either benign (non-cancerous) or malignant (cancerous). In recent years, breast cancer has become one of the deadliest and epidemic diseases in the world [1][2][3][4][5]. A literature review on the breast cancer shows that, breast cancer has become common in women [1] and cancer disease cases are expected to be 27 million by 2030 [2]. In the literature, different machine learning models are proposed as a solution in the reduction of death rate caused by breast cancer with computer assisted breast cancer diagnosis system.
Breast cancer is the second major cancer disease in women in the world [3]. The disease is common in developed countries in the past but is rapidly increasing in middle-income and low-income countries too. This shows that, the cancer disease cases are increasing rapidly and machine-learning algorithms are required for decision support to reduce the epidemic cases by predicting breast cancer as early as possible. The major problem in breast cancer prediction with machine learning is the imbalance between the benign and malignant observations in breast cancer dataset [4]. Breast cancer prediction involves a binary classification problem where an observation belongs to either malignant or benign class. However, the number of benign observations is always greater than the number of malignant observations in the dataset as the numbers of non-cancerous people are greater than the number of cancerous people in the real world. The imbalance of observation in the dataset creates a problem to machine learning algorithm which results in incorrect predictions on the class of interest which is the malignant (minority class). As machine learning algorithm more frequently learns the majority class, the model also predicts the benign (majority class) with better accuracy than the minority class. Hence, a standard machine-learning model makes biased prediction towards the majority class.
In this research, we have proposed breast cancer prediction model with adaptive boosting algorithm to optimize the prediction performance of decision tree algorithm due to biased prediction towards benign observation. Furthermore, this study, investigates the answers to the following research questions: 1. How to optimize predictive performance of decision tree for classification of imbalanced breast cancer? 2. What is the performance of decision tree and adaptive boosting algorithm for predicting breast cancer? 3. Which feature (s) in the breast cancer dataset has strong relationship to the class feature?

LITREATURE REVIEW
Many research works have been conducted on breast cancer classification. The research works applied different machine learning algorithms for developing predictive model for classification of breast cancer. Some of the previous research works on breast cancer classification  are discussed in this section. In [5], naïve bayes, RBF and J48 algorithms are applied to Wisconsin breast cancer dataset. The dataset consists of 699 observations and two classes (malignant and benign) and 9 features. The experimental result of the study shows that naïve bayes algorithm performed better than RBF and J48-decision tree algorithm.
In [6], deep neural network and support vector machine is applied to an online breast cancer data repository collected from broad GDAC firehouse available online at https://gdac.broadinstitute.org/. The algorithms are evaluated against their predictive accuracy and result shows that the highest accuracy achieved by the support vector machine is 69.8%. The deep neural network performed lower than the support vector machine. In [7], the authors applied support vector machine (SVM), naïve bayes (NB), decision tree (DT) and k-nearest neighbor (KNN) on Wisconsin breast cancer dataset and proposed a breast cancer prediction model with SVM, NB, DT and KNN. The data repository contains 699 observations of which 459 are benign and 241 are malignant. The comparative performance analysis on the efficiency of the prediction models shows that SVM has better accuracy than the other algorithms.
In another study [8], on breast cancer prediction model is proposed by employing three machinelearning algorithms namely, linear regression, decision tree and random forest. In the study, the authors applied these machine-learning algorithms on the Wisconsin breast cancer data repository. The predictive performance of the proposed model is analyzed and the result of analysis shows an accuracy of 84.14%. The regression algorithm is used to analyze the relationship between the attributes in the data repository. In [9], support vector machine algorithm is applied to 573 observations collected from medical repository. The authors compared the performance of linear and non-linear support vector machine. The result of performance analysis shows that linear support vector machine outperformed than the non-linear support vector machine.
In another study [10], NB and logistic regression is applied to the Wisconsin breast cancer data repository. The data repository contains 697 observations and 11 features. The authors compared the performance of the proposed model and the result of performance analysis shows that the naïve bayes algorithms outperformed than the logistic regression algorithm. In [11], breast cancer prediction model is proposed by employing the support vector machine algorithm Wisconsin data repository. The number of observations used in the dataset is 569 and the number of features is 10. The predictive performance of the proposed breast cancer prediction model is evaluated and the accuracy of the algorithm is 90.86%. The accuracy result shows that support vector machine performed well on the prediction of breast cancer.
In [12], a support vector machine and convolutional neural network (CNN) based breast cancer classification model is proposed. In the study, CNN is used for feature extraction and the support vector machine is employed for prediction of the breast cancer. In [13], KNN based breast cancer prediction model is proposed. The dataset consists of 209 observations collected manually by the authors. The predictive performance of the proposed model is accepTable with prediction accuracy of 93%. In another study [14], a decision tree algorithm is applied to Wisconsin breast cancer prognosis dataset and a breast cancer prediction model is proposed.
In [15], the authors compared the accuracy of naïve bayes algorithm with decision tree and support vector machine algorithm on breast cancer data collected from Wisconsin data repository. The dataset consists of 699 observations and among the observations, 458 are malignant and 248 are benign. The result of performance analysis shows that the support vector machine outperformed the KNN and naïve bayes algorithm having a better accuracy score on breast cancer prediction. In [16], SVM and KNN is applied to Wisconsin breast cancer and a predictive model is proposed using these algorithms. The dataset contains 699 observations and 11 features. The authors compared the performance of the algorithms and result shows the support vector machine as a better algorithm with higher accuracy than the KNN algorithm. Another study [17], employed the Wisconsin breast cancer data repository to analyze the predictive performance of KNN algorithm on prediction of breast cancer. The predictive performance of the proposed KNN based breast cancer prediction model has an average accuracy of 76%.

RESEARCH METHOD
In this research, breast cancer dataset collected from the kaggle repository is employed in training and testing the proposed model. In the implementation and experimental testing, Python programming language is employed. A statistical method that is Pearson's correlation analysis and data visualization as well as feature relationship measures are employed for identification and interpretation of breast cancer data repository to discover the relationship between the class and the features in observations. Decision tree and adaptive boosting algorithms are employed for developing the prediction model. The data repository consists of a list observations that belong to malignant (cancerous) and benign (non-cancerous) class. The percentage of the malignant and benign observations in the data repository is demonstrated in Figure 1.  Table 1. The dataset observations used in training is 75% and in testing 25% of the observations is used. The breast cancer dataset features are demonstrated in Figure 2. As demonstrated in Figure 2, the number of malignant observations is more than the benign observations.

Correlation analysis
We have employed Pearson's correlation analysis for visualization of the relationship between each feature. This helps to identify the feature that is strongly related to the class feature in the data repository. The Pearson's correlation matrix for each features of the breast cancer dataset is shown in Figure 3. As shown in Figure 3 the class is perfectly related to mean radius and mean perimeter features. This shows that breast cancer prediction is highly influenced by those features.

RESULTS AND DISCUSSION
In this section, the experimental test results on the proposed model is explained. The predictive performance of decision tree and adaptive boosting algorithm is analyzed by employing the performance metrics such as accuracy and confusion matrix along with learning curve of the algorithms.

Predictive accuracy analysis
The predictive performance of the proposed model is experimented on the training set. The predictive accuracy of the proposed model is shown in Figure 4. Moreover, the accuracy for decision tree and adaptive boosting for breast cancer classification on random test is given in Table 2.

Confusion matrix analysis
A confusion matrix is a measure the predictive performance of the proposed models in terms of the number of correct and incorrect predictions on the test set by the decision tree and adaptive boosting algorithm. The confusion matrix of the decision tree and adaptive boosting algorithm is shown in Figure 5(a) and Figure  5 The accuracy of the decision tree model is calculated as using the (1). Accuracy=(55+45)/(55+45)/(55+45+11+3)*100=87.71%, likewise, the accuracy of the adaptive boosting algorithm is calculated as, Accuracy=(59+43)/(59+43+3+9)*100=89.47%. This result shows that the adaptive boosting algorithm outperformed than the decision tree algorithm.

Learning curves
Learning curves of the proposed model shows the performance of the model on training set as demonstrated in Figure 6. As demonstrated in Figure 6, the learning curve for the proposed model's testing error is higher for the decision tree model than the adaptive boosting model. The testing error for decision tree model falls in the range 12.5% to 25%, which shows that the accuracy of the model falls in the interval 75% to 87.5%. The testing error for the adaptive boosting algorithm falls in the range 0.03% to 0.11% and this shows that the accuracy of the adaptive boosting algorithm falls in the range 89% to 97%.
(a) (b) Figure 6. The learning curve for Adaboost and decision tree, (a) Decision tree learning curve, (b) Adboost learning curve

CONCLUSION
In this research, we have proposed a breast cancer prediction model with adaptive boosting and decision tree algorithm on breast cancer dataset collected form kaggle data repository. The proposed model solves the problem of biased classification on imbalanced observation by non-ensemble algorithm through ensemble classifier namely the adaptive boosting. The predictive performance of the proposed model is evaluated by employing different performance metrics such as accuracy and confusion matrix on the test set. The result of performance analysis reveals that the adaptive boosting algorithm has better performance than the decision tree. Hence, the adaptive boosting algorithm is a better classifier for imbalanced dataset where the use of non-ensemble algorithm such as decision tree, results in biased prediction towards the majority class yielding better performance on prediction of the majority class and poor performance on the minority class.