Classification of multiclass imbalanced data using cost-sensitive decision tree C5.0

ABSTRACT


INTRODUCTION
In the digital era nowadays, the amount of data available in the database is increasing time by time. However, not all of the data can be used and utilized even though a lot of data has important information as future knowledge. Based on classes' distribution, the dataset is divided into two, namely binary data and multi-class data [1]. Binary data only has two classes while multi-class data has more than two classes [2]. Binary imbalanced data and multi-class imbalanced data are problems that must be faced nowadays [3]. The majority class is a class that has a ratio number of instances more than other classes, whereas the minority class is often considered not influential in the machine learning because it only has less ratio number of instances [4]. In the real world, minority class has an important impact because it contains useful knowledge information [5], such as diagnosis types of disease [6], predicting credit worthiness of bank [7], and telecom customer churn prediction [8].
In the classification process using machine learning, minority classes are often misclassified because machine learning prioritizes majority classes and ignores minority classes [9]. There have been many techniques carried out by researchers to solve problems in imbalanced data [10]. In classification process, the multi-class imbalanced data has more difficult level to solve rather than binary imbalanced data because multi-class imbalanced data can have more than one minority classes [11]. There are many ways and techniques that can be used for multi-class imbalanced data problems [12]. However, to obtain optimal One technique that is effective and efficient to handle the imbalanced data problem is cost sensitive decision tree [13][14]. The technique is two methods combination, namely cost-sensitive learning and decision tree. The technique works by making a decision tree model, in which calculates the most minimum cost of misclassification [15][16]. Thus, imbalanced data misclassification can be minimized by using cost sensitive learning techniques in the decision tree model. Therefore, it needs new development in making a decision trees model. The C5.0 algorithm is a development of the C4.5 and ID3 algorithms [17]. The C5.0 algorithm works by using information gain. It can make the decision tree model more effective and faster in making decisions [18][19][20]. In this research, we will focus on handling multiclass imbalanced data problems by using cost sensitive decision tree C5.0. Then, cost sensitive learning uses metacost method. The method works by re-labeling and pruning of a decision tree model to build a minimum cost model. The decision tree model has an impact to the performance and rate of the forward classification process. This paper aims to compare the performance of C5.0, C4.5 and ID3 algorithms in designing the decision tree model.

RESEARCH METHOD 2.1. Dataset
This research is using datasets from UCI Machine Learning Repository, namely: Glass, Lympography, Vechile, Thyroid, and Wine. The reason of choosing these five datasets because that the multi-class imbalanced data is often used in the problems of classification research [21][22][23]. The description of these dataset is listed in Table 1. 10-fold cross validation is used in machine learning process because it has become a standard for increasing classifier results, where the dataset is divided into two, namely 90% for training and 10% for testing [9].

Preprocessing
Preprocessing is the processing of datasets by analyzing and improving data so that it can produce new datasets that are good to use in the next process. The way to handle the problem is by going through several stages, including changing/cleaning data, reducing data and transforming data [24]. The next step is to normalize min-max by changing the value of the datasets by scaling 0-1 with the aim of weighting the information contained in balanced datasets to do the attribute selection process. Stated below is Min-max normalization (1).
y is the result of scaling new values (0-1), xi is the data value that wants to be changed. Min (x) is the minimum value of the attribute (x) and max (x) maximum value of the attribute (x).

Particle swarm optimization
Particle swarm optimization (PSO) is a method of optimization that is good at improving performance by selecting attributes based on the value of information weights in the datasets. Informative and relevant attributes will be closer to 1 while the attributes that are not needed are closer to 0. There are two parameters in determining the value of information weight, namely the velocity value and position value [25]. Below are the steps of selecting attributes using particle swarm optimization: Step 1. Input the dataset.
Step 2. The velocity and initial position of the attribute are determined randomly.
Step 3. Initial velocity at position 0 towards optimal point 1.
Step 4. Calculating the velocity of attribute i in d dimension using the following (2): Step 5. Calculating the position of individual attributes i in d dimensions using the following (3): Step 6. Updating Pbest and Gbest to get the most optimal velocity and position on each attribute.
Step 7. Attributes whose position and initial velocity do not change/are worth 0 (zero) or have small value are removed after the update process is complete. where,

Decision tree
The Decision Tree is a classification method in data mining classification that can be used to solve classification problems both binary class and multi-class [26]. The concept of decision tree model is a top down structure that turn data into a decision tree then produce rules. The decision tree requires algorithm formula for determining the root node, internal node and leaf node to make a decision tree model [27]. Algorithms that are often used to solve classification problems are ID3, C4.5 and C5.0 [28]. Comparisons of each algorithm are shown in Table 2.

C5.0 Algorithm
C5.0 algorithm is a development of the previous algorithms, ID3 and C4.5 [17]. The advantages of C5.0 are that it is low memory usage, faster in making decisions and more effective in building decision tree models [18,20]. C5.0 model works by calculating the entropy value and information gain on each attribute [19]. The attribute that has the largest information gain value then becomes the root node. Then the root node forms a new criteria branch called the internal node. The process will stop if no new criterion is generated. The down node then functions as an output called leaf node. The Leaf node work to predict the class of the classifier [26]. Below is the procedure of how to make the decision tree model using the C5.0 algorithm: Step 1. Input dataset training Step 2. Calculating the total entropy value of the dataset using the following (4): Step 3. Calculating the entropy value of each criterion.
Step 4. Calculating the information gain value for each attribute using the following (5) Step 5. The attribute that has the largest information gain value as the root node.
Step 6. Creating a branch from the node's criteria.
Step 7. Repeating the process of calculating the entropy value and information gain. The process will stop if all the attribute criteria have produced the predicted class.

Cost-sensitive learning
Cost-sensitive learning is a learning method that can solve imbalanced data problems [14]. The concept of the method is to reduce misclassification cost so that can improve the performance of the classifier. This method assumes that the minority class has a higher misclassification cost than the majority class. Therefore, machine learning will focus on accurately classifying minority class. Cost sensitive can be categorized into two criteria. The first criterion is the direct method which its classifier is cost sensitive to them such as ICET and cost-sensitive decision tree. The other category is the cost-sensitive meta-learning method. It is categorized into two techniques, namely thresholding and sampling. The thresholding technique has methods, namely MetaCost, CostSensitive Classifier, Cost-sensitive naïve Bayes and Empirical Thresholding [18]. Figure 1 show structure of cost sensitive learning

MetaCost
Metacost is a cost-sensitive method that uses the thresholding meta-learning approach to minimize costs [15]. The working principle of metacost is calculating the probability value of each leaf node [16]. When the probability value of leaf node is > 1, re-learning of each node is done by pruning or relabeling until the minimum cost is obtained. The cost is denoted as C (i, j), where i is the actual class but is predicted to be class j which causes misclassification [22]. The cost with the minimum value will be used as the predicted class or new leaf node to get an optimal performance. Below is the procedure of applying the metacost algorithm.

Validation multiclass
Cost matrix is a representation of the results of prediction errors (misclassification) in machine learning [8]. Cost matrix is defined as C (i: j) where i is the actual class and j is the predicted class [22]. Misclassification predictions cost of multi-class are presented such as C(2,1), C(3,1), C(1,2), C(3,2) ,C(1,3) , C(2,3), while correct classification cost of multi-class are presented such as C(1,1), C(2,2), C(3,3)that cost value are 0. Table of cost matrix is shown in the Table 3. Predicted data of the classification model are presented using the Confusion matrix table, which contains information about the actual class presented in the row of the matrix and predicted class on the column [15]. Confusion matrix is used to measure the performance of a classifier model based on TP, TN, FP and FN. True positive (TP) and true negative (TN) when the prediction results are true [8]. False positive (FP) when the actual prediction results are (-) but actually (+). False negative (FN) when the prediction results are (+) but actually (-).   The performance of classifier can be seen from the accuracy of classification, where the higher classification accuracy this means the better performance of the classification model [15]. The accuracy is a testing method based on the number of correct result prediction classification. The equation for calculating the value of accuracy is shown as follows:

TP Accuracy
TP TN FP FN   (6) Misclassification cost is the value of class prediction errors because the classifier cannot predict the class correctly [8]. It is inversely proportional to the accuracy value if accuracy value is large, the performance of the classifier is good classification but if the cost has large value, the performance of the classifier is bad classification. The equation to calculate the cost is shown as follows:

Training process
In this research, there are several steps or stages of research, such as (i) data collection, (ii) preprocessing, (iii) attributes selection, (iv) training and testing process, (v) validation. These are the method solution steps that will be carried out to solve the problem on imbalanced multiclass data.
Step 1. Input the dataset for processing.
Step 2. Replacing the missing value and remove the outliner from the dataset and perform the min-max normalization process.
Step 3. Selecting the attribute that will be used using PSO.
Step 5. Performing a remodeling of the decision tree by minimizing the cost of using the metacost method.
Step 6. Performing validation by calculating the value of accuracy and total cost.

RESULT AND DISCUSSION
The classifier model testing is carried out using an Intel Core i5 personal computer, 16 GB RAM, Windows  Table 5 is the result of testing performance accuracy using machine learning "rminner". The output of machine learning is a table confunction matrix that shows the performance of the classifier. This research uses 3 testing scenarios. The first scenario, testing uses the decision tree algorithm (ID3, C4.5, and C5.0). The second scenario, testing uses scenario 1 and PSO. The third scenario, testing uses scenario 2 and the metacost method. In the first scenario, (1) ID3 has the smallest performance because it is not able to certify continue data. (2) Testing using C4.5 has good performance in data glass, lympografi, and wine with a value of 68.22%, 75.68%, and 93.82%.  Figure 2 is a representation of Table 5 where the best performance is using scenario 3. Performance C4.5 has the highest accuracy value in vechile and wine data with a value of 76.86% and 97.62%. C5.0 performance has the highest accuracy value in glass and thyroid data with a value of 76.17%, 95.81%. While the lympographic data on C4.5 and C5.0 have the same accuracy value that is equal to 83.33%. Figure 3 is the percentage of the total value of all datasets of the accuracy value of scenario 3. From the percentage of 100%, ID3 has a performance percentage of 19.23%, C4.5 has a performance percentage of 40.24%, and C5.0 has a performance percentage of 40.91%.  Table 6 is the result calculating cost of the classifier. Cost is the value of misclassification, if the classifier has large cost value, the classifier is called a bad classification but if the classifier has a small cost, the classifier is called a good classification. In the first scenario, ID3 has the largest cost value of all algorithms. C4.5 has small cost value in data glass, lympography and wine. C5.0 has small cost value in vechile and thyroid data. In the second scenario, ID3 is the worst classification. C4.5 has small cost value in vechile and lympography data. C5.0 has small cost value in data glass and thyroid. Whereas C4.5 and C5.0 have the same cost in wine data with values of 200. In the third scenario, ID3 is the worst classification. C4.5 has small cost value in data vechile and wine. C5.0 has small cost value in data glass and thyroid. Whereas C4.5 and C5.0 have the same cost on the lympographic data with values of 1058. Table 7 shows the time needed during the testing process in machine learning. The data is taken during the testing process from scenario 3. Based on the testing of the 5 datasets, C5.0 has a faster processing time in making decisions than C4.5. Figure 4 shows a graph of the training process time. Blue line is the line from the C4.5 process while red line is the line from the C5.0 process. Based on the graph C5.0 has a faster processing time than C4.5.  Vechile  156800  156800  156800  23762  21632  15138  22898  23762  17298  Glass  41472  41472  41472  9248  9248  8192  9800  8192  5202  Lympografi  15138  15138  15138  2592  1568  1058  2738  1682  1058  Wine  28322  28322  28322  242  200  32  288  200  98  thyroid  8450  8450  8450  338  338  392  288 288 162

CONCLUSION
This paper is compiled based on the literature and theoretical concepts of data mining and machine learning to find a solution to handle multi-class imbalanced data problems. In this research 3 proofs were conducted. First proof, using scenario 2, PSO can improve the performance of C5.0 as shown in Table 5. The second proof, using scenario 3, C5.0 has a better performance than ID3 and C4.5. C5.0 has a percentage of performance of 40.91% while C4.5 and ID3 are 40.24% and 19.23% as shown in Figure 3. The third proof, based on Table 7, C5.0 has a faster time in making decisions than C4.5. Based on these three proofs, it can be concluded that cost sensitive decision tree C5.0 method has a better performance than using ID3 and C4.5 for solving multiclass imbalanced data problems. In future research, C5.0 can be combined with the cost sensitive learning method using Meta learning sampling techniques such as costing and weighting to obtain a more optimal classifier performance.