IAES International Journal of Artificial Intelligence (IJ-AI)

Received Oct 16, 2021 Revised May 20, 2022 Accepted Jun 18, 2022 Dengue is a dangerous disease that can lead to death if the diagnosis and treatment are inappropriate. The common symptoms that occur, including headache, muscle aches, fever, and rash. Dengue is a disease that causes endemics in several countries in South Asia and Southeast Asia. There are three varieties of dengue, such as dengue fever (DF), dengue hemorrhagic fever (DHF), and dengue shock syndrome (DSS). This disease can currently be classified using a machine learning approach with the input data being the dengue symptoms. This study aims to classify dengue types consisting of three classes: DF, DHF, and DSS using five classification methods including C.45, decision tree (DT), k-nearest neighbor (KNN), random forest (RF), and support vector machine (SVM). The dataset used consists of 21 attributes, which are the dengue symptoms. It was collected from 110 patients. The evaluation method was conducted using cross-validation with k-folds of 3, 5, and 10. The dengue classification method was evaluated using three parameters: precision, recall, and accuracy, which were most optimally achieved. The most optimal evaluation results were obtained using SVM with k-fold 3 and 10 with precision, recall, and accuracy values reaching 99.1%, 99.1%, and 99.1%, respectively.


INTRODUCTION
Nowadays, computer technology has been applied in various fields, including the medical field, in expert systems [1].Over the last few years, expert systems have been developed.The expert system is constantly evolving because it can be integrated into clinical decision-making to predict disease and assist physicians in diagnosis.This system is a computer program that contains knowledge from one or more human experts related to a particular disease.
Expert systems help patients find out the diagnosis results more efficiently based on the symptoms that occur and are felt.Moreover, it can be used at any time to become more economical.Therefore, this system contributes to disseminating expert knowledge to wider users.Expert systems can provide a more accessible and helpful way for human experts to develop and test new theories, especially in healthcare.The data used in the expert system can vary, such as images [2]- [4], signals [5], or medical record data which includes name, age, laboratory test results, and symptoms of the patient [6]- [8].In the medical field, several expert systems have been developed, for example for estimating drug doses [6], [9], monitoring the disease progress [1], [2], [10], and detecting several types of diseases such as diabetes mellitus [11], pancreatic cancer [12], breast cancer [13], glaucoma [14], and dengue fever (DF) [15]- [17].Dengue fever is an arboviral disease caused by infection with one of the four dengue virus serotypes.It is spread through contact with the dengue (DENV) virus.According to the World Health Organization (WHO), this disease is estimated to have a global burden of 50 million illnesses annually, and about 2.5 billion people worldwide live in dengue-endemic areas [18].A person can develop dengue fever with various symptoms, such as headache, muscle aches, fever, and a measles-like rash, also known as fracture fever [16].Regarding statistical data in varied countries, several dengue-endemic cases were reported in Saudi Arabia, especially in the western and southern provinces of the Jeddah and Mecca areas, the first in 2011, when 2,569 cases were reported, and the second, in 2013 when 4,411 cases including 8 deaths were reported.Dengue has also occurred in other areas of Saudi Arabia, including Medina (2009) and Aseer and Jizan (2013) [18].Meanwhile, the Malaysian Ministry of Health reports that dengue fever has grown rapidly since 2012.In 2015, the Malaysian Ministry of Health published a report recording 107,079 cases of dengue fever with 293 deaths, while there were 43,000 cases of dengue fever with 92 deaths in 2013 [19].The rapid spread of the dengue virus has become more and more dangerous, and addressing this issue should be considered an urgent case.Additionally, the national incidence of dengue hemorrhagic fever (DHF) in Indonesia increased from 50.8 per 100,000 population in 2015 to 78.9 per 100,000 population in 2016 [20].The clinical diagnosis can range from symptomatic dengue fever (DF) to a more severe form known as DHF, and the most fatal is dengue shock syndrome (DSS) [18].
Classification of dengue varieties has been carried out using a computer-based system.The input data are symptoms suffered by patients such as fever, headache, pain behind the eyeball, joint pain, muscle pain, and other symptoms.In addition, the thrombocyte and hemoglobin values of the patient are also indicative of dengue disease.The classification system is needed to immediately find out the symptoms suffered by the patient without convening the expert or doctor.It can be applied using several methods, such as rule-based [21]- [24], or using machine learning [25].The following methods were used in prior studies to implement the machine learning-based classification process: naive Bayes [6], logistic regression [12], random forest (RF) [8], [12], [16], k-nearest neighbor (KNN) [26], artificial neural network (ANN) [10], [13], dan support vector machine (SVM) [8], [14].
The study based on machine learning was developed using several approaches, including KNN, linear SVM, naive Bayes, J48, adaboost, bagging, and stacking, to classify autism spectrum disorders in adults.The best results were obtained utilizing the bagging, linear SVM, and naive Bayes methods with an accuracy of 100% based on the test results using cross-validation with k-fold 3, 5, and 10 [6].Classification of hepatitis disease applied based on SVM including linear SVM, polynomial SVM, gaussian radial basis function (RBF) SVM, and RF with a comparison of training and test data of 90% and 10%.The proposed SVM and RF methods succeeded in predicting the data correctly.They managed to achieve the best results with a value of 0.995 [8].Comparison of SVM kernel selection implemented on diabetes dataset using linear SVM, polynomial SVM, and RBF kernel.The results obtained using the linear SVM kernel get the best results with an accuracy of 77.34%, while the RBF kernel obtains the lowest results with an accuracy of 65.10% [27].
Subsequently, the extraction of the contour cup on the retinal fundus image was carried out to detect the patient of glaucoma by applying the multi-layer perceptron (MLP), KNN, naive Bayes, and SVM methods.The SVM method achieved the best accuracy results with a value of 94.44%, while the lowest results were performed by the MLP method with a value of 72.22% [14].Improving the quality of mammogram images based on the region of interest is needed to obtain optimal breast cancer classification results using the hybrid optimum feature selection (HOFS) and ANN as the classifier feature selection method.The use of feature selection is able to reduce the number of features and improve the classification results with fewer features based on the values of accuracy, sensitivity, and specificity of 99.7%, 99.5%, and 100%, respectively [13].Antibiotic resistance detection based on machine learning was classified into two classes: resistant and sensitive.With the area under the curve-weighted metrics of 0.822 and 0.850, respectively, the stack ensemble technique produced the best results in the original and balanced datasets.Sex, age, sample type, Gram stain, 44 antimicrobial substances, and antibiotic susceptibility values were all included as the dataset attributes [7].
This study aims to classify dengue disease varieties divided into DF, DHF, and DSS.The input data used are symptoms caused by the disease.Classification is done by applying several machine learning methods consisting of C.45, DT, KNN, RF, and SVM, where the evaluation is carried out using cross-validation.The following sections structure the paper: section 2 describes the dataset and methods used, section 3 presents the result and discussion for each classification method based on the performance evaluation, and section 4 concludes the paper.

MATERIALS AND METHODS
This section describes the dataset details used and the classification method.It also provides information on the process for evaluating the performance of each classification method.In this study, the dataset provided by Dirgahayu Hospital, Samarinda, Indonesia, consisted of 110 cases of dengue patients.The dataset is divided into three classes, including DF, DHF, and DSS, comprising 40 data, 61 data, and 9 data, respectively.The data was collected in the form of patient code (Pcode), age, and symptoms experienced, including thrombocyte and hemoglobin values from each patient and the results of the diagnosis obtained from the expert.The symptoms experienced by each patient may vary, so the results of the diagnosis of dengue type from the expert vary.The example of several data collected from the patients is shown in Table 1.
This study consists of two stages: training and testing.Both of them have two main processes are pre-processing and classification.Additionally, there is an evaluation method process required to measure each classifier's performance.The input of the evaluation method is the diagnosis resulting from the expert (actual class) and classification method (predicted class).The overview of the dengue classification method is depicted in Figure 1.  1 had to be converted into numerical data to be utilized as input into the classification process.The patient's data included 18 different kinds of dengue disease symptoms (S), including fever (S1), headache (S2), joint pain (S3), muscle soreness (S4), maculopapular skin rash (S5), and petechiae (S6) (S4).S6), bruising (S7), shock (S8), anxiety (S9), vomiting (S10), constipation (S11), diarrhea (S12), heartburn (S13), red eyes (S14), lower jaw discomfort (S15), cough (S16), sore throat (S17), and nasal cavity inflammation (S18) and thrombocyte (T) and hemoglobin (H) values.Therefore, there are 20 attributes that become input for the following process, namely classification.The symptom data obtained from the patients in Table 1 are not numerical type so that in this process, each symptom experienced by the patient is given a value of 1.In contrast, if the patient does not experience these symptoms, it is given a value of 0. Meanwhile, the data on platelets and hemoglobin do not need to be pre-processed.Based on the pre-processing, the data obtained are ready to be used in the classification process.The pre-processing data in this study are shown in Table 2.
Table 2.The resulting of pre-processing

Classification
The classification process is carried out using a machine learning approach.Machine learning is an artificial intelligence (AI) area that contains techniques that allow computers to learn from empirical data, such as sensor data databases [7].There are five classification methods implemented in this study, consisting of C.45, DT, KNN, RF, and SVM.Those classification method has been successfully implemented in several previous studies [6], [8], [16].Classification is done using a cross-validation technique with k-fold 3, 5, and 10 to distribute training and testing data [6].An explanation regarding each classification method used is explained in the following sub-section.

K-nearest neighbor (KNN)
KNN is a supervised machine learning algorithm that can address classification and regression issues [24].The vast majority of neighbors are considered input data.The KNN method must be performed numerous times with various K values in order to find the K that minimizes errors while maintaining prediction accuracy.With n number of data, a brute force search technique is implemented using the Euclidean distance function for the nearest neighbor search as in (1), where xi and yi are the testing and training data to i, respectively.

Random forest (RF)
RF is a file-producing machine learning method that is flexible and simple to use.Even without hyperparameter tweaking, a superb outcome will produce most of the time [8].The RF is one of the most extensively used algorithms due to its simplicity and diversity.A RF is a group of trees that combines each decision tree (DT) based on a set of random variables.The DT is a vectorized flowchart [12].For dimension , the predictor variables are represented by the random vector =(1, 2, …, ) , while a random variable y represents the real value response.Figure 2 is an illustration of the structure of a RF.
The majority of RF parameters are based on two data objects.One to a third of the instances is not counted in the sample used to obtain unbiased data, notably from the batch or OOB, which estimates the classification error and the significance of the variable when substituting the sample for the current tree when producing the training set.The tree then processes all data for each case pair, and the closeness is calculated.

ISSN: 2252-8938 
Dengue classification method using support vector machines and cross-validation … (Hamdani Hamdani) While the distance between two enclosures grows by one, they occupy the same end node.At the end of the run, the number of split trees is used to conduct proximity normalization.The detection of outliers is based on proximity, data substitution, and highlight.Outlier detection uses proximity as well as missing data replacement and highlighting to obtain low dimensional representations of data [8].
Figure 2. The illustration of the structure of a RF

C.45 decision tree (DT)
Most studies employ the C.45 decision tree, which is a comprehensive method of machine learning.C45 is typically used to generate a classification tree based on a hierarchical tree system, with attributes and leaf nodes illustrating the solution findings.The C45 approach's visual categorization is successful and efficient.C45, on the other hand, is prone to data noise.Regression tree (CART), automatic chi-square interaction detector (CHAID), ID3, and C4.5 are some DT techniques employed.As a result, C.45 was used as one of the ways to improve classification accuracy in this investigation [6].

Decision tree (DT)
A DT is a hierarchical structure that resembles a block diagram and comprises three essential elements: decision nodes that correspond to attributes, edges, or branches that correspond to multiple possible attribute values [28].The leaf component is the third component, and it contains items that are usually of the same type or are quite similar.This view enables us to define decision rules for classifying new instances.In reality, each path from the root to a leaf corresponds to a conjunction of the test qualities, and the tree is thought of as a substitute for these conjunctions.The building (induction) and classification (inference) processes form the majority of DTs [28]: i) Build procedure: to set the training data.A DT is typically formed for a given training set by starting with an empty tree and using the attribute selection measure to select a "suitable" test attribute for each decision node.The rule is to pick an attribute that reduces class confusion between each test-generated training subset, making it easier to define the object's classes.The process is repeated for each sub-decision tree until the desired foliage is attained and the grades are approved.ii) Classification procedure: to classify a new instance that only has the values of all of its attributes.It is conducted by starting at the root of the built tree, taking the path that corresponds to the observed attribute value in the inner tree node.This technique is repeated until the leaf is found.Finally, we use a bound label to determine a specific instance's anticipated class value.

Support vector machine (SVM)
SVM is a supervised learning method for classification and regression.The primary purpose of SVM is to classify results by mapping data between input vectors and a large viewpoint space.As a result, linear SVM seeks to fully use the distance between the decision hyperplane and the marginal distance, the closest data point [27].This study used linear SVM defined in (2), where { and j} is the dataset.

Performance evaluation
Performance evaluation was conducted against the classification method using three measures parameters: precision, recall, and accuracy [12] based on the confusion matrix multiclass.The value of evaluation parameters is in the range of 0 to 100.The method indicates high performance if those parameters are close to 100.Those parameters are defined [29]: A confusion matrix is a machine learning concept that stores information about a classification system's actual and expected classifications.A confusion matrix contains two dimensions: indexed by the actual item class and the classifier predicts class.The fundamental structure of a confusion matrix for multi-class classification problems is shown in Figure 3, with the classes A1, A2, and An.The number of samples belonging to class Ai but identified as class Aj [29] in the confusion matrix is represented by Nij. Figure 3 shows a multi-class confusion matrix with n number of classes.
The dengue patient dataset is divided into k subsets for the type of diagnosis data.In general, the data (k-1)/k is used for training, and the data 1/k is used for the testing.Then, the process is reiterated k-times.The validation result of mean k-time is selected as the last rate estimation as a final point.In this study, the performance is measured using cross-validation with the k-fold value of 3, 5, and 10.

RESULTS AND DISCUSSION
Pre-processing was done by discretization, which aims to convert all patient data into a numeric type.Based on the data collected from dengue patients, there were 18 symptoms (S1-S18) and values for thrombocyte (T) and hemoglobin (H).Hence, a total of 20 attributes were used as input data for the following process, namely classification.Symptom data is given a value of 1 if the patient experiences it.Otherwise, it is worth 0 if the patient does not experience it.The dengue diagnosis results are divided into three classes, namely: DF, DHF, and DSS.
In the classification process, five methods were applied, consisting of C.45, DT, KNN, RF, and SVM.These were done to obtain the most optimal method performance, which was measured using three parameters, namely precision, recall, and accuracy.This value is obtained based on a multiclass confusion matrix using a cross-validation technique with three different k-fold values, namely 3, 5, and 10.A comparison of the performance of the five classification methods obtained with the barbed k-fold value is summarized in Table 3.
Table 3 shows KNN with k-fold 3 yielded the lowest performance indicated by precision, recall, and accuracy values of 93.9%, 93.6%, and 93.6%, respectively.At the same time, the three parameters achieve a value of 94.8%, 94.5%, and 94.5% for k-fold 5 and 10.RF with k-fold 10 and SVM with k-fold 3 and 10 respectively had the maximum performance of the classification method with precision, recall, and accuracy of 99.1%, 99.1%, and 99.1%, respectively.As seen in Table 3, the k-fold value has an effect on the precision, recall, and accuracy values.K-fold 5 and 10 implemented in the C.45 classification method and the optimal performance KNN were created, respectively, with 97.3% and 94.5% accuracy values.Meanwhile, the DT with the best accuracy value, 97.3%, was generated with k-fold 5, while RF with k-fold 10 and SVM with k-fold 3 and 10 achieved the highest accuracy of 99.1%.Overall, k-fold 10 yields the best results for each classifier, with the exception of the DT, which yields the best results at k-fold 5. Table 3 represents the number of successfully classified and incorrectly classified data for each method.The comprehensive study results of each classifier present in Figures 4-8.  with a larger amount of data for further study.Therefore, it can increase the value of accuracy and is better for predicting or classifying other diseases.

Int
Dengue classification method using support vector machines and cross-validation … (Hamdani Hamdani) 1121

Int
Dengue classification method using support vector machines and cross-validation … (Hamdani Hamdani) 1125

Figure 4 (
a) depicts the confusion matrix of the C.45 classifier for k-folds 3, while Figure 4(b) depicts the confusion matrix for k-folds 5 and 10.

Table 3 .
Dengue classification results using various classifiers and k-fold