Analyzing the behavior of different classification algorithms in diabetes prediction

ABSTRACT


INTRODUCTION
Diabetes is a well-known chronic disease that poses a significant issue in the world health systems.The pancreas releases the insulin hormone, which permits glucose to transit from food to the bloodstream [1].Diabetes is caused by the lack of insulin and can result in serious damage such as coma, heart attack, weight loss, cardiovascular dysfunction, blindness, ulcer, and damage of the nerve system [2].In 2020, 10% of the United States population has diabetes according to centers of disease control and prevention (CDC) [3].Studies show that the number of people having diabetes will reach 25% in 2030 [4].The availability of accurate early prediction can significantly reduce the risk and severity of diabetes.
Over the last few years, there was a growth in the number of applications that use artificial intelligence in diagnosing diseases such as breast cancer [5], [6], Alzheimer [7], parkinson's disease [8], coronary artery disease [9], skin disease [10], anxiety and depression disorders [11].The use of classification algorithms in computer-aided diagnoses systems has helped in the detection of diseases at an early stage.The main goal of this study is to find which classification algorithm achieves minimum error rate in predicting diabetic patients without human involvement.The early detection of diabetes can enhance the treatment process and save the patient's life.
In this research, we will evaluate the performance of different classification algorithms like an artificial neural network, decision tree, support vector machine, random forest, and naïve Bayes using a diabetes dataset.In recent years, several data mining techniques were used to predict diabetes.The use of such methods has helped in early detection, which resulted in minimizing the complication of diabetes.The summary of literature review is illustrated in    [19] not specified implemented an ensemble model by combining the propablities of different machine learning algorithms.

96%
Paul [12] analyzed the performance of three classification algorithms; random forest, support vector machine, and logistic regression in predicting diabetes.The results show that random forest algorithm achieves a high accuracy rate of 84% than support vector machine and logistic regression.Dey et al. [13] implemented a web application for diabetes prediction.The implemented model used an artificial neural network among different classification algorithms like support vector machine, k-nearest neighbor, and naïve Bayes.The proposed model of artificial neural network and min-max scalar achieves a high accuracy rate of 82% compared to other classification algorithms.
Swapna et al. [14] implemented a methodology to extract complex features from heart rate variability data using long short-term memory, conventional neural network (CNN).These features are fed into support vector machine for diabetes prediction.The proposed system achieved a 95% accuracy rate.
Zhu et al. [15] proposed a data mining framework to diagnose diabetes.The framework used principal component analysis to address the correlation issue in the feature set, a k-mean clustering algorithm to clean the data from outliers, and a logistic regression algorithm to classify the data.The proposed framework was able to successfully classify the data with a 97% accuracy rate.
Wu et al. [16] suggested a two-stage framework to diagnose diabetes.In the first stage, the incorrectly clustered data are eliminated using an improved k-mean algorithm; the resulting dataset is then used in the following stage as input to the logistic regression algorithm to classify the data.The model attains a high accuracy rate of 95%.
Tripathi and Kumar [17] used several predictive analysis algorithms namely support vector machine, linear discriminant analysis, random forrest, and k-nearest neighbor.The study shows that Random Forrest algorithm gives a high accuracy rate of 87% compared to other algorithms.Li et al. [18] implemented a diabetes prediction model using a k-nearest neighbor algorithm.The proposed system used metaheuristic algorithms with k-mean to select features.The model achieves a good accuracy rate of 91%.Husain and Khan [19] proposed a diabetes prediction model by merging the probabilities of different machine learning algorithms into an ensemble model.
In our research, the performance of different machine learning algorithms in diagnosing diabetes is evaluated.The research aims to find the algorithm that achieves the maximum accuracy in identifying diabetes at an early stage.The remainder of this study is organized as follows; Section 2 discusses the algorithms applied in this study and the structure of the dataset.Section 3 shows the results of different classification algorithms, and this study is concluded in section 4.

RESEARCH METHOD 2.1. Data set description
In this paper, we used diabetes dataset that can be found in Ukani [20] to assess the performance of different classification algorithms.The dataset contains medical information of female patients.The dataset consists of 8 numeric features and one target class to determine if a patient has diabetes or not (0 or 1).The diabetic dataset contains the information of 2000 female patients.In data preprocessing, we removed noise and outliers.The dataset is divided into train and test datasets with a test size equal to 20%.

Model design and implementation
In this paper, we analyzed the behavior of different classification algorithms to predict diabetes.The procedure used in this study is illustrated in Figure 1.Different machine learning algoritnms namely artificial neural network, random forest, support vector machine, decision tree, and naïve Bayes have been implemented to find the algorithm that achieves the minimum accuracy rate in predicting diabetes.

Artificial neural network
One of the main algorithms in machine learning is an artificial neural network.They are designed to mimic the behavior of the human brain; it consists of input, hidden and output layers.The hidden layer is responsible of finding hidden information in the input data and translates it into a form that the output layer can use [21].Figure 2 illustrates the architecture of the applied artificial neural network in this study.

Random forest
Random forest is a well-known classification algorithm that is utilized in solving regression and classification problems.It gets the prediction from multiple decision tree classifiers that work on parts of the dataset instead of the entire dataset.The performance of the model increases, when the number of classifiers increases [22].We implemented a random forest model using Python libraries.

Support vector machine
Support vector machine is a powerful classification algorithm that is used in regression and classification.Support vector machine works by finding the best hyperplane that separates between different classes of the data.The data are plot in m-dimensional space where m represents the number of features [23].We implemented support vector machine using Python libraries.

Decision tree
A decision tree is a supervised learning algorithm.It focuses on generating rules from the training data that are used to predict the class.Decision tree consists of a root, branches, and leaf nodes.One of the most challenging points in decision tree is selecting the feature that best split the data [24].We implement a decision tree algorithm using Python libraries.

Naïve bayes
It is a probabilistic machine-learning algorithm that depends on Bayes theorem.All features are independent and irrelevant from each other is assumed in naïve Bayes algorithm.One of the main advantages of naïve Bayes is its ability to work with data problems such as imbalanced datasets and missing values [25].We implement it using Python libraries.

RESULTS AND DISCUSSION
Classification algorithms performance is assessed using a variety of metrics like precision, receiver operating characteristic (ROC) curve, recall, F-measure, accuracy, and area under the curve (AUC) [26], [27].Five classification algorithms were evaluated in our study namely artificial neural network, naïve Bayes, random forest, decision tree, and support vector machine.The result of different classification algorithms in terms of true positive, true negative, false positive, and false negative are illustrated in Table 2. Table 3 illustrates that random forest performs better than any other classifiers with regard to precision, recall, accuracy, and F-measure.On the other hand, support vector machine obtains the minimum accuracy rate of 73.50%.For more qualitative demonstration, Figure 3 depicts the ROC curve of the classification algorithms.Random forest ROC curve is closest to the top left corner than any other classifiers.We also measure the performance of the classifiers using AUC to show which classifier has high separation ability between different classes.Figure 4 shows that random forest obtains a high AUC value of 0.88.

CONCLUSION
Diabetes is an incurable disease that affects 9% of the world's population.Diabetes can lead to serious complications if it is not probably treated.As a result, diabetes prediction at an early stage is critical in the treatment process.We have analyzed the performance of multiple machine learning models namely artificial neural network, support vector machine, random forest, Gaussian naïve Bayes, and decision tree using diabetes dataset for 2000 female patients.We conclude that random forest surpasses other machine learning models in diagnosing diabetes with a 90.75% accuracy rate and 0.88 AUC score.

Figure 1 .
Figure 1.Diagram of the proposed model

Figure 2 .
Figure 2. The architecture of the artificial neural network

Table 2 .
Confusion matrix of different classification algorithms