Sampling methods in handling imbalanced data for Indonesia health insurance dataset

Health insurance fraud is one of the most frequently occurring fraudulent acts and has become a concern for every insurance. According to data from The Indonesian General Insurance Association or Asosiasi Asuransi Umum Indonesia (AAUI), the private insurance industry suffered losses up to billions rupiah throughout 2018 due to the fraudulent acts commited by the perpetrators. The problem in with the number of frauds in Indonesia is that the current system is highly vulnerable and they is still done manually. The other problem from this detection is imbalance data which often occurs in fraudulent cases. In this research, we used a sampling methods using several machine learning as the baseline. The result shows that the instance hardness thresholding algorithm and extreme gradient boosting gives the best performance for all the case. It shows the method can reduced the bias and can achieve better generalization. This is an open access article under the CC BY-SA license.


INTRODUCTION
Health insurance fraud is one of the most frequently occurring fraudulent acts and has become a concern for every insurance.Fraud in health insurance can be in the form of a claim against a case not covered by insurance or even making a claim for action on a health procedure that has never been carried out.Another problem in detecting fraud is the procedure always done in manually.
A public health expert said the Healthcare and Social Security Agency or Badan Penyelenggara Jaminan Sosial (BPJS Kesehatan) suffer a loss around Rp 1.86 trillion (USD 140.5 million).Furthermore, Rp 6.9 trillion required further investigation [1].Based on a report provided by the Nations Association of Certified Fraud Examiners (ACFE), losses caused by fraud in Indonesian health services are around 5% of the total service cost.Based on observational data, around 175,000 claims are proven fraud with a total loss of 400 million rupiahs.Until 2019, there is a possibility of 1 million claims suspected of being a fraud [2].
According to Fatimah et al. [2], the prevalence of fraudulent activities in Indonesia has been exacerbated by a system that is extremely vulnerable to such deceptions.Currently, the majority of fraud detection mechanisms are manual, which substantially lengthens the time required to identify illegal processes.As a result, the rate of effective identification and resolution of fraud lags behind, creating an urgent need for enhanced automated systems to improve security.
It can be concluded from each of the problems mentioned above that a system that can detect fraud in health insurance data is needed.Several studies have carried out health insurance fraud detection [3]- [5] Int J Artif Intell ISSN: 2252-8938 ❒ 349 using the extreme gradient boosting (XGBoost) method as a classifier to detect health insurance fraud.Other studies such as [5], [6] and used the Catboost method, and [7] used the logistic regression method for modeling.The other obstacle is the problem of imbalanced datasets that often occur in fraud cases.Several studies have attempted to overcome this problem.One of them is [8], which uses a random undersampler to solve the imbalanced problem.In this study, we propose several solutions to the problem of insurance fraud detection.One proposed solution is to use XGBoost [3]- [5], which will be compared by several algorithms such as K-nearest neighbor (KNN), support vector machine (SVM) and logistic regression.We also compared without using sampling and using outcome sampling.The main objectives for this research are: − Examining for the best algorithm in the insurance fraud detection process.− Result comparison between using the sampling method and not using the sampling method.− Examining a sampling method that delivers optimal results on health insurance data from health insurance companies in Indonesia.
The rest of the paper is structured into five sections: section 2 provides previous research on health insurance fraud.Section 3 explains the dataset and the distribution of the data.Section 4 explains the methodology in this paper, and each step of the methodology will be explained thoroughly.Section 5 shows the experiment and the result of the experiment.Section 6 concludes all the finding.

RELATED WORKS
Research to overcome health insurance fraud has been carried out in various ways, one of which is the artificial intelligence approach.Research conducted by Akbar et al. [3], the methods used are XGBoost and random forest for the modeling process.These methods are also used to improve the results of conducting a random undersampling process to obtain balanced data .The conclusion provided in this study is that XGBoost with random undersampler and tuning parameters delivers good results.However, it should be understood that this method is prone to noise or outliers.
Shamitha and Ilango [9].proposed the use of an artificial neural network (ANN) for modeling.Sharmita carried out several preprocessing stages: handling missing values, data filtering, label encoding, data transformation, and scaling.The following stage is the dimensional reduction process using principal component analysis (PCA).Besides that, Sharmita used the synthetic minority over-sampling technique (SMOTE) method to overcome the problem of imbalanced data before it was finally included in the modeling.This study concludes that ANN with the right process provides the best accuracy results compared to several other machine learning benchmark methods.
Hancock and Khoshgoftaar [6] proposed the Catboost method compared to the XGBoost method.This study found that Catboost was used to provide a baseline for the method to reduce the preprocessing process for imbalanced data.This is also proven by the higher mean area under of curve (AUC) of XGBoost for complex features.The same research was conducted by Hancock and Khoshgoftaar [5] with the same data but different experimental scenarios.The results provided confirm what was stated in the study [6].Different from the approach of Hancock and Khoshgoftaar [5], [6] and Akbar et al. [3], the approach taken by Seo and Mendelevitch [10] and Georgakopoulos et al. [11] was to detect fraud with outlier detection analysis.
Based on the research done by previous researchers, it is known that fraud detection always has problems in handling imbalanced ones.Several studies have also tried to deal with this by providing samplings such as the random undersampler [3] and SMOTE [9].Therefore, this research will focus on determining what the best sampling method is.Our modeling process will use XGBoost, KNN, logistic regression, and SVM methods.

DATA
The data used in this study was acquired from the data warehouse of one of the health insurance companies in Indonesia.Data has 11,882 data instances with 51 features.Table 1 and Figure 1 provide an overview of the distribution between fraud and non-fraud classes.We divided the data into independent and dependent variables from the given data.Of the 51 available features, only 12 features were used as independent variables, i.e.Total Claim, Total Claim Amount, Total LOS, TotalICU LOS, InvestigationFlg, BasicPremiumCollected, PolicyAge, InsuredAge(claim), InsuredSmoke, InsuredGender, InsuredWeight, InsuredHeight and one feature is used as class is HoldFlg.

METHOD
The process carried out at the methodology stage is as: the first process carried out is preprocessing the data.Preprocessing starts with selecting the features used using Pearson's correlation coefficient.Subsequently, we checked whether the data had a missing value and checked the distribution of the data.The last step is to change the data from text to categorical using a label encoding approach.Afterward, the normalization process is carried out for each selected feature.The normalization process used is scaling using min-max values.In the following process, we divided into two scenarios which will be explained in more detail in section 5. We selected four classifiers used as baselines for this study.The four classifiers used are SVM, logistic regression, XGBoost, and KNN.XGBoost selection in this study referred to the research by [4], [5], [12] and the logistic regression method selection was based on [7], [12], [13].Research conducted by In the following stage, the object assessed the model with measurements such as accuracy, precision, recall, and f1-score.The research methodology was better explained in Figure 2.Each subsection provides a complete description of the steps of the methodology used; starting from preprocessing, sampling methods, machine learning models and lastly, evaluation metric.

Preprocessing
The selected main features will be examined for correlation using Pearson's correlation coefficient in the first stage.The formulation of Pearson's correlation coefficient is as [14]: x,y is a feature.After understanding the correlation for each feature, we looked for a correlation value that was more than the threshold we set for the drop process for that feature.
The following preprocessing stage was the search for missing values.Subsequently, we handled missing values by dropping data that had missing values.Then, the label encoding process where each string category would be assigned a discrete value as one example is gender = ["Male", "Female"] to gender = [0, 1].
The last step of preprocessing was normalization.In the normalization process, we used feature scaling with min-max, where the formulation is [15] f is the data value of the feature to be processed by the feature scaling.

Sampling method
Random undersampling removes the majority class; thus, the number of data is equal to the number of data from the minority class.It is one of the easiest methods to deal with the imbalanced data problem [16].Instance hardness is an undersampling method that removes "hard" samples.If the probability of the data is smaller than the specified threshold, then the data will be deleted [17].Random oversampling by replicating the minority class by taking all values in the class and replicating these values ; thus, the dataset increases.Random oversampling does not change the diversity of the sample and does not create new data [16].SMOTE is one of the oversampling methods in which the data with the minority class is increased to equal the majority class.SMOTE sampling is done to minimize the expected risk [18].Adaptive synthetic sampling (ADASYN) is an oversampling weight distribution method from minority class samples based on the difficulties in the learning process.The more synthetic samples generated for minority samples, the harder it is to learn [19].Synthetic minority over-sampling technique-TomekLink (SMOTE-Tomek) is a hybrid learning method that combines SMOTE and TomekLink methods.The standard flow of TomekLink is; first, the imbalanced data (D) is used by SMOTE for the balancing process, and then the Tomek Link algorithm is used to carry out the undersampling process [20].A similar process is carried out by synthetic minority over-sampling technique-edited nearest neighbor (SMOTE-ENN), where SMOTE carries out the oversampling process, and the undersampling process is carried out by the ENN method [21].

Machine learning method
KNN is a supervised learning method.KNN is a non-parametric method.KNN itself is a relatively simple machine learning method where this method performs classification by checking the surrounding k data, where the majority class will be the class for new data.This method is often called the lazy model because it does not do intensive learning during training [13].In the calculating distance, the KNN method uses distance methods such as Euclidean distance with the formula.XGBoost is a method proposed by Chen and Guestrin to improve the gradient tree method [22].XGBoost itself is an efficient algorithm and is one of the most frequently used methods for Kaggle competition.The main idea of this method is to form sub-trees of the original tree where each subsequent tree reduces the error of the previous tree [22].Logistic regression is a supervised learning method that is usually used in multivariable.Unlike linear regression, which produces continuous results, logistic regression produces binary targets [23].The SVM determines the optimum hyperplane for separating the data by assuming that the best decision boundary has the greatest distance and margin from both classes of data.SVM is also known as a maximum margin classifier [24].

Evaluation metric
This study evaluates model performance using accuracy, precision, recall, and F1-score.Accuracy is the model's positive and negative prediction accuracy.Precision measures the accuracy of positive identifications, while recall measures the accuracy of real positives.The F1-score balances precision and recall, making it beneficial when false positives and false negatives have distinct costs.The formulas for the four metrics are [13]: P recision = T P T P + F P (4) where TP is true positive, TN is true negative, FP is false positive, and FN is false negative.

EXPERIMENT AND DISCUSSION
The experiments in this paper were carried out using a dataset from one of Indonesia's largest health insurance companies.An explanation of the dataset and its distribution is explained in section 2. In solving the problem, this experiment was carried out using several python libraries such as scikit-learn [25], pandas [26], [27], and imblearn [28].Figure 3 describes the distribution of data in a scatter plot using PCA to reduce the dimensions of the data held.Figure 4 provides an overview of the heatmap of the correlation coefficient for the features used.The scatter plot of our health insurance fraud dataset showed the distribution between fraud and nonfraud is not easily to separate.In this research, we used a threshold for the Pearson correlation is 0.8.Based on the correlation heatmap from Figure 4, each features did not have more than 0.8 correlation values.Therefore, we used all the features for our research.Moreover, the Table 2 showed our data did not have any missing values.Subsequently, we divided this research into four parts.In the first part, we researched health insurance fraud using only raw data or referred to in the following as the "Vanilla" approach.While in the second part, we experimented with using several over-sampling methods.The following part used several undersampling approaches, and lastly, we used a hybrid sampling approach.Each experiment will be conducted with 10fold cross-validation to determine the best model.Table 3 illustrated the result of our experiment in vanilla approach, while Tables 4 to 10 illustrated the result of our experiment using several sampling methods.Based on Table 3, the results provided illustrate that the accuracy results are unbalanced with the precision, recall, and f1-score values.This can be considered a problem because the accuracy value describes a reasonably high result, with the maximum on XGBoost being 99.09%.In reality, the precision, recall, and F1-score values are unbalanced with the accuracy value.This imbalance is because of the confusion matrix given on Figure 5.It can be seen that most predictions are centered on the method that has the largest training data.This is a problem in the bias towards the machine learning model created.Based on Tables 4 to 10, the results given illustrate that the sampling process greatly affects the results and reduces the habit of machine learning models.The most suitable approach for our health insurance data is undersampling, especially using instance hardness sampling.This is because the instance hardness thresholding Int J Artif Intell, Vol. is an algorithm that looks for the probability of no "hardness" in the dataset.In addition, from every classifier model used in this study, the XGBoost method almost beats every method compared.This result shows that the margin of difference between precision, recall, and f1-score is the smallest compared to other methods.CONCLUSION This paper discusses sampling algorithms for reducing bias in the evaluation results of imbalanced data.The data used is data from one of Indonesia's largest health insurance companies.This research uses four machine learning methods as a baseline, i.e.SVM, KNN, logistic regression and XGBoost.The sampling used is in three categories, i.e. undersampling with random undersampling, instance hardness thresholding; oversampling with random oversampling, SMOTE, and ADASYN; and lastly, hybrid sampling using the SMOTEENN and SMOTE-Tomek methods.This study's results show that the sampling process, either undersampling, oversampling, or hybrid, can reduce habits in machine learning methods.This can be seen where the values of precision, recall, and f1-score are very low in the "Vanilla" approach, where the best for precision, recall, and f1-score are less than 0.3.Meanwhile, after using sampling, the values between accuracy, precision, recall, and the f1-score have similar values , with the biggest difference being 0.1.This study also found that the best classifier is XGBoost.It can be observed that, unlike other classifier methods, the XGBoost method provides unbiased results and has the smallest margin for each sampling method.

Figure 2 .
Figure 2. Workflow of Indonesia health insurance fraud detection

Figure 3 .
Figure 3. Scatterplot PCA for health insurance fraud dataset

Table 1 .
Data distribution frequency in health insurance dataset Figure 1.Data distribution in Indonesia health insurance dataset 4.

Table 3 .
Result without sampling method (vanilla approach)

Table 4 .
Result with over-sampling method oversampler

Table 5 .
Result with over-sampling method SMOTE

Table 6 .
Result with over-sampling method ADASYN

Table 8 .
Result with undersampling method instance hardness