Early prediction of chronic heart disease with recursive feature elimination and supervised learning techniques

ABSTRACT


INTRODUCTION
Heart attack, commonly known as chronic heart disease (CHD), blocks the blood vessels in the heart and causes chest pain, and stroke [1].A heart attack results in complications among patients suffering in the cardiological intensive care unit, often resulting in poor prognosis and high mortality.Early prediction of CHD with machine learning techniques is vital to save the life of the patient by reducing the mortality rate.However, the prediction of CHD with machine learning techniques is a critical challenge in the area of clinical data analysis [2].Determining the optimal features for discrimination of patients from the healthy sample is one of the challenges in developing machine-learning techniques for the early prediction of CHD.
Machine learning (ML) techniques have become an important topic in the computer science field for assisting clinical decision-making [3].ML techniques discover the pattern in the CHD dataset and try to generalize the relationship between the predictive outcome and the independent variables or features.Thus, by developing ML techniques the risk of getting CHD is predictedable at an early stage.Moreover, these techniques assist practitioners in clinical decision-making for accurate identification of CHD.[4] developed a recurrent and agate hybrid neural network-based predictive model for heart dataset analysis.The discussion on the effectiveness of the developed method shows higher accuracy in predicting CHD.The effectiveness of the proposed method shows 98.69% accuracy.Even though higher overall accuracy shows promising results, the study does not show class-wise accuracy.Furthermore, the study does not suggest the effect of each CHD feature on the time complexity of fitting the proposed model on the training dataset.
Another research article by Masih et al. [5] discussed the multilayer perceptron-based deep neural network for predicting coronary CHD.The study suggested that pre-processing with missing value removal, and feature selection improved the performance of the proposed deep neural network by 3.36% achieving an overall accuracy of 96.50% after pre-processing for prediction of coronary CHD.Similarly, the sensitivity and specificity of the deep neural network improve after feature selection and pre-processing compared to the original dataset.
Correspondingly, another research by Assegie et al. [6] investigated that feature selection improves the effectiveness of the extreme boosting (XGBoost) model for matching patterns between the predictor and predicted feature in the CHD dataset.The result shows that the XGBoost model achieves 99.6% accuracy in generalizing the presence or absence of CHD.CHD predictors such as chest pain type, thallium scan, and the history of CHD are the most significant feature in discriminating the absence or presence of CHD.However, blood sugar, anginal pain, and cholesterol have less importance to the generalization of the XGBoost model.Although feature selection was the strength of the study, the performance of other supervised ML techniques is investigated.
Similarly, a study by Houssein et al. [7] applied a deep learning model that detects CHD risk factors.The result indicated that the proposed deep learning model scores 93.66% precision in detecting the risk factors of CHD.The result of the proposed model is promising.However, the study does not discuss the effect of features on the effectiveness of detecting the CHD.
Several studies [8]- [10] have discussed the efficacy of different ML techniques such as support vector machine (SVM), decision tree (DT), K-nearest neighbor (KNN), and artificial neural network (ANN) to predict CHD.The studies suggested that the efficacy of the ML techniques differs from study to study and the CHD dataset employed for developing the predictive ML techniques.Additionally, the study suggests that preprocessing with feature selection improves the efficiency of the ML model for detecting the CHD.
Likewise, Sarra et al. [11] investigated a chi-square (X2) statistical approach for improving the performance of ML techniques on CHD detection.The study employed chi-square statistics for selecting the most discriminative features of CHD.After selecting the discriminative feature of CHD, the ML techniques trained on the optimal feature set.Then the performance of the ML model compared to the original and featureselected dataset.The result indicated that the performance of the SVM model improve by 5.26% achieving an overall accuracy of 89.47% on feature selected dataset.
Neurmious studies [12], [13] have investigated stacked ensemble learning models for predicting CHD.The stacked ensemble model is developed by employing different ML techniques such as logistic regression (LR), and the random forest (RF), DT, and KNN as base classifiers.The experimental result of the stacked ensemble model and the base ML techniques shows that the stacked model outperforms the base model with an accuracy of 88.71% for CHD prediction.Although the stacking of different base models improves the model prediction performance, the importance of feature selection to the improvement of model performance is not presented in the study.
The performance of ML techniques should be improved to obtain accurate results of the predictive outcome.The diagnosis of CHD largely depends on identifying the most important features with discriminative power between CHD positive and CHD negative class observations [14].The study introduced the utilization of a hybrid model for improving the performance of ML techniques.The decision tree and random forest base hybrid model produces a predictive performance of 88.7% accuracy outperforming the individual decision tree and random forest model.
The use of chi-square (χ2) statistics for the elimination of the irrelevant features in the dataset helps in overcoming the overfitting and underfitting complexities of ML techniques employed for CHD diagnosis [15].Furthermore, the performance of ML techniques improves by combining different models and producing hybrid ML models.The study of the neural network performance with an ensemble approach shows that the ensemble model achieves powerful predictive performance for CHD.
From the literature servey presented in this section, and other several studies [16]- [19], the researchers hypothesize that the important factor of CHD prediction is the risks that are associated with it.The RFE improves the predictive power of ML techniques by identifying the important features of the CHD.This article evaluates the importance of feature selection in improving the performance of different supervised ML techniques.The study also evaluates the predictive power of the ML techniques on the original and featureselected dataset.We discuss the contributions of this work as: i) To explore the efficacy of recursive feature selection for predicting CHD; ii) To apply the recursive feature elimination (RFE) and identify the most discriminative feature of CHD; iii) To improve the predictive accuracy by training the ML techniques on the selected features by the RFE; and iv) To discuss the supervised ML techniques on prediction of the presence or absence of CHD.The rest of the paper is structured as: i) Section 2 describes the method, the proposed approach to CHD prediction and research procedure is presented; ii) The simulation results are explained in section 3; and iii) Section 4 provides the conclusion and future scope.

METHOD
This section discusses the steps and research procedure followed to conduct this study.Firstly, the CHD dataset is collected from the Cleveland data repository.The Clevland CHD dataset is previously employed [20]- [24] for evaluation of the performance of machine learning in predicting CHD.Secondly, the dataset is pre-processed (at this steps the missing values are eliminated from the dataset, and the dataset is split into a training set (80%) and a testing set (20%).Secondly, different ML techniques (KNN, SVM, DT, stochastic gradient descent (SGD), adaptive booting (ADB), Naïve Bayes (NB), multilayer perceptron (MLP), and LR) model is trained on the training set and evaluated on the original dataset using accuracy, receiver operating characteristic, and fitting time of each model.Thirdly, the significant features of CHD are selected using the RFE feature selection technique.Finally, the models are trained on feature-selected datasets and their performance is measured by employing performance measures such as accuracy, receiver operating characteristic curve, fitting time of the models, and their predictive power in identifying CHD. Figure 1 indicates the flowchart for the study.

Recursive feature elimination (RFE)
The recursive feature elimination improves the performance of the ML model.The RFE reduces the size of the feature for training the model.The feature elimination method reduces redundant features, which mislead model fitting and pattern identification processes during learning [25], [26].Additionally, research articles [27], [28] investigated that the RFE improves the performance of gradient boosting, and KNN in predicting cardiovascular disease.Thus, we aimed to further investigate the effectiveness of RFE on other ML models.
Int J Artif Intell ISSN: 2252-8938  Early prediction of chronic heart disease with recursive feature elimination … (Komal Kumar Napa) 733

RESULTS AND DISCUSSION
This section presents the predictive power of different supervised ML techniques for detecting the absence or presence of CHD.The comparative analysis employed accuracy, the area under the receiver operating characteristics curve (ROC-AUC), and fitting time in comparing different models.

The performance of ML techniques
The performance of ML learning techniques is measured using an accuracy metric on the testing dataset.The performance of each ML model is evaluated on the original and the feature-selected dataset.Some of the models such as LR and SVM appear to improve with the feature-selected dataset.However, most of the models decrease in accuracy as demonstrated in Table 1.
Figure 2 shows the accuracies of each ML model on the original and feature-selected dataset.The MLP achieves the highest accuracy (93.69%) in predicting the presence of CHD on the original dataset.However, the accuracy of the MLP decreased on the feature-selected dataset having an accuracy value=87.39%.In contrast, the SVM and LR models scored higher accuracy on the feature-selected dataset than the original dataset.The DT and KNN model achieves the highest accuracy (89.91%) compared to other models on the feature-selected dataset, indicated in Figure 2.

RFE and the fitting time of ML techniques
The fitting time complexity of the supervised ML model on the original and feature-selected dataset demonstrated the X model has a faster fitting time compared to the other models.Figure 3 indicates the fitting time complexity of the supervised model on the original and feature-selected dataset.As shown in Figure 3, the fitting time of LR, SVM, NB, KNN, SGD, DT, and ADB have lower fitting time.In contrast, the fitting time of the MLP has higher fitting time on the feature-selected dataset compared to the original dataset.
In addition to the fitting time demonstrated in Figure 3, the supervised ML model evaluation employed the receiver operating characteristics area (ROC-AUC).Figure 4 demonstrates the ROC-AUC for each of the supervised ML models.As revealed in Figure 4, recursive feature elimination improved the area under curve of NB, and KNN on predicting CHD.In contrast, the area under curve of LR, SVM, DT, and SGD remained roughly similar on the original and feature-selected dataset.The ADB and MLP models have a lower area under the curve for the feature-selected dataset than the original dataset.

CONCLUSION
This study investigated the efficiency of the recursive feature elimination (RFE) for selecting CHD features for predicting the presence of the disease using different ML models.The result indicated that RFE is important to select relevant features, and reducing training time.The RFE is effective at selecting those features in a training dataset that are more or most relevant in predicting the CHD.RFE preserves the feature importance of feature-selected data and is quite the same as the original data based on the observation above.With the feature-selected dataset, we obtained predictive accuracy of 89.91 % with the KNN and DT models.Thus, it implies that the ML models can be used in clinical decision-making and CHD risk analytics.The major contribution of this study is that it investigated how RFE influences the predictive power and the time complexity for fitting the ML models on datasets with different dimensions.However, the limitation of this study is that the ML models are not tested on various datasets.In future work, it is recommended to test the RFE on different datasets to validate the findings of this study.

Figure 1 .
Figure 1.The flowchart of the study

Figure 3 .
Figure 3.The cross-validation fitting time of the ML model on the original and feature-selected dataset

Figure 4 .
Figure 4.The ROC-AUC score of the ML model on the original and feature-selected dataset

Table 1 .
The performance of supervised ML techniues