Forecasting the number of dengue fever based on weather conditions using ensemble forecasting method

ABSTRACT


INTRODUCTION
Dengue fever is a dangerous infectious disease whose cases have steadily increased over years.Dengue fever is caused by a virus that is in the saliva of Aedes mosquito that injects human body parts that varies from mild into severe conditions [1], [2].As stated from Epidemiological data and Surveillance Center, Ministry of Health, Indonesia, in Indonesia, dengue fever is still a crucial problem, this is because the number of infections and the area of distribution is increasing along with the increase in mobility and population density.
Based on the Ministry of Health Republic of Indonesia's data, in 2019, the case fatality rate (CFR) of dengue fever showed a value of 0.67% on a national scale.CFR is obtained from the proportion of deaths to all reported cases.A province is said to have a high CFR if it exceeds 1%.One of the provinces that has a high CFR is East Java with 1.01%.Based on the Malang Regency Health Office, Malang Regency is the area with the highest number of cases and deaths from dengue fever in East Java in 2019, therefore efforts are needed to control the death rate from dengue fever in Malang.
To control the mortality rate in Malang Regency, one of the efforts is to predict the number of dengue fever cases in the future, one of the research conducted is building a model to forecast so that the parties in charge could take steps and arrange policies to minimize the increase in cases and mortality rates.Several forecasts related to dengue fever have been carried out by utilizing weekly or monthly number of cases [3]- [5].Based on the previous research, there is a fairly high correlation between the number of cases of dengue fever with rainfall, temperature [6], and humidity [7].

ISSN: 2252-8938 
Forecasting the number of dengue fever based on weather conditions using … (Mursyidatun Nabilah)

497
Penalized regression is a regression model using penalty that aims to reduce overfitting in multiple linear regression [8].In this study, Ridge, Lasso, Elastic Net, smoothly clipped absolute deviation (SCAD), and minimax concave penalty (MCP) were explored.In order to overcome limitations of single forecasting model, ensemble methods are able to increase the performance of base model with higher accuracy and identify complex object, and uncertainties [9]- [12].

METHODOLOGY
Based on Figure 1, there are four big steps to build ensemble model with penalized regression.First, raw data are gathered from various sources.It consists of climate data (temperature, humidity, wind speed, rainfall) and number of dengue cases.Data cleaning is carried out to produce processed data that are ready to be used for model development.Data will be splitted into two parts that consist of training data and testing data.Training data will be used to train the model and the testing one will be used to measure model performances, also those datas will be used to determine parameter of penalized.
In building ensemble forecasting model, five penalized regressions that consist of Ridge, Lasso, Elastic Net, SCAD, and MCP will be trained and validated by each.Ridge regression widely used for high dimensional data where independendent variables are highly correlated, this method aims to reduce multicollinearity [13], Lasso is a method that used regularization and variable selection to increase interpretability and accuracy [14], Elastic Net is a combination of Ridge and Lasso regressions, so it will retain the advantage of both methods [15], SCAD regression aims to improve Lasso's penalty by reducing the bias in the model because the Lasso penalty tends to be linear in the size of the regression coefficient [16], and MCP is other alternative to give less biased variables in sparse model [17].Penalized regression parameter will be determined.
After evaluating each model, aggregated prediction is formed by calculating the average prediction results from the model (averaging).In general, the steps carried out are implemented by taking sequential data based on the time dimension.After the ensemble forecasting equation has been successfully formed, then forecasting is carried out on the dependent variable (weekly number of cases of dengue fever) using the ensemble forecasting model that has been formed on the test data.After the formation of forecasting models and predictions have been made, the analysis is carried out by predicting the magnitude of the incidence of dengue fever and strategy analysis.After that, the model performance test was carried out.
This study will try to test two forms of data to get the most optimal model results.The form of data to be tested consists of normal data and data that has been transformed into natural logarithm (ln).Based on the research, the natural logarithm transformation was carried out to stabilize the variance when performing standard regression procedures.In addition to the unstable variance (not constant), the transformation can also be used to correct for non-linearity and residuals that are not normally distributed (non-normality) [18].

RESULTS AND DISCUSSION
The experiments were performed on an Intel® Core™ i5-7200U central processing unit (CPU) @ 2.50 GHz 2.70 GHz, random-access memory (RAM) with 8 GB (gigabyte) which is running on Windows 10 home single language x64 bit.The software tool used is Rstudio and R for the programming language.The steps show the results from each research steps that consist of splitting data, determining parameters, building penalized regression model, ensembling model, and compare the model's performance with other related methods.

Splitting data training and data testing
To conduct training on the model, the data is divided into two parts into training data and testing data.Training data is part of a dataset that is trained to make predictions or perform functions from other machine learning algorithms according to their respective goals.Basically, the user provides clues through an algorithm so that the trained machine can find the correlation on its own.While data testing is part of the dataset that is tested to see the accuracy of the model, in other words, its performance.The distribution of overall data from 2014 to 2018 sequentially with the proportion of training data compared to testing data of 70% and 30%.

Determined penalized regression parameter
The parameter used in penalized regression is the lambda value.The lambda parameter controls the amount of regularization applied to the regression model.The larger the lambda value, the more coefficients are depreciated to zero.When the lambda value is equal to 0, the regularization does not apply and the model runs linear regression.Lambda value with cross validation score with the smallest error value are taken for each model and the proportion of data [19].Table 1 shows the selected lambda values along with the lowest mean squared error (MSE) score for each lambda in the Ridge, Lasso, and Elastic Net models.When compared with the selected lambda, Lasso and Elastic Net have a large enough lambda value from the Ridge model.This is because if the lambda value is greater, there is a possibility that a variable has a coefficient equal to zero.It can be said that several independent variables are not chosen to be predictors in Lasso and Elastic Net regression models, considering that Lasso regression has a variable selection feature in it [20], and Elastic Net is a combination of Ridge and Lasso models [21].
The calculation of the best lambda values for the SCAD and MCP models in Table 2 is slightly different from Ridge, Lasso, and Elastic Net models.The best lambda value is selected based on the lowest cross-validation error (CVE) value.The best lambda that can be used on SCAD is 0.908 with a CVE of 73.17, and MCP has a lambda of 0.520 with a CVE of 69.34.

Building penalized regression model
Testing data is carried out in each penalized regression model with the proportion of 70:30 for training and testing data.The performance of each model is measured based on the root mean squared error (RMSE) and symmetric mean absolute percentage error (SMAPE) values.The model was tested on both forms of data, namely normal and logarithmic transformation data.Since the number of cases' smallest error numbers on normal data (RMSE: 6.38) is lower than logarithmic transformation (RMSE: 8.95), normal data is chosen for building the penalized regression model.The performance results of each penalized regression on normal data can be seen in Table 3.Based on the test results on normal data, the SCAD model has the best performance among other penalized regression models, followed by the Elastic Net, Ridge, MCP, and finally Lasso models.Based on the order of the smallest RMSE, the models will be combined based on scenarios based on the best RMSE value.When it is viewed from the prediction pattern of each model, Ridge on Figure 2 can capture the pattern quite well, it can be seen from the prediction pattern which tends to follow the increase and decrease in the actual data.
The SCAD model in Figure 3 can also capture data patterns well.When compared to the Ridge and MCP models, SCAD tends to be more able to follow patterns in the early period with time range of October 28, 2017 (10/28/2017) to December 28, 2017 (12/28/2017).It can be seen from the pattern of predictive data that tends to decrease so that the range of error values is smaller in this section.Even so, the increase in data that occurred in the period from October 31, 2018 (10/31/2018) to November 30, 2018 (11/30/2018) could not follow the pattern as good as the Ridge and MCP models.The MCP model in Figure 4 can also follow the data pattern quite well.It can be seen from the ability of the prediction results to follow the actual value.Forecasting using Lasso has a variable selection feature, where the model will select independent variables that have relevance to the dependent variable.Even so, the results of the Lasso model are less able to capture the actual data pattern and tend to be less sensitive.It can be seen in Figure 5 where the increase in cases cannot be captured properly by the Lasso model, so this can also be the cause of the RMSE of the Lasso model having the greatest value among other models.The prediction of the Elastic Net model in Figure 6 also has a variable selection feature like Lasso's.However, Elastic Net can still capture patterns and spikes in the testing data well, this is because the model also combines the Ridge model in it, so that a combination of Ridge predictions is obtained that can handle multicollinearity [22] and could select variables according to existing data patterns.

Building ensemble model
Based on the best RMSE value of the model that has been tested further, a combination of each is carried out according to the scenario that has been set in.The normal data ensemble scenario in Table 4 shows    The best model from the experimental results is used to predict the next 8 weeks, starting from January 2019 to February 2019.To predict the number of dengue fever cases in the next 8 weeks, the main thing to do is to predict each independent variable first, such as temperature, humidity, rainfall and wind speed.In the independent variable forecasting process, the methods used are different depending on the data pattern.Based on observations, the variables of air temperature, air humidity, and wind speed have cyclical data patterns, where the data patterns are repeated over a long period of time [23].Therefore, the three variables can be predicted using the multiplicative decomposition method [24].The results of temperature forecasting for the next 8 weeks are shown in Figure 7.The forecast used the best model by predicting the number of cases each week ahead, and the previous forecasting results are used as a lag feature in the next data.This is done 8 times until the forecasting results are formed.The combination of SCAD + Elastic Net with normal data which is the model with the best performance is used to predict the number of cases of dengue fever in Malang Regency.Forecasting period is the first 8 weeks of the year, from January 2019 to February 2019.From the forecasting results obtained, there will be a decrease in the number of dengue fever cases for the next 8 weeks.This visualization is shown in Figure 8.When compared with the actual data on dengue fever cases in 2019 listed in Table 5, the number of cases in 2019 tends to increase.The cause of these differences can be caused by the presence of other factors or variables that cause an increase in cases of dengue fever.The dengue transmission of dengue cases in East Java tends to be influenced by population density, population mobility, urbanization, residental areas and in public places [25].In this research, the variable used to predict is only based on climate, so it is possible that there are other factors than climate that are sufficient to have more influence on the increase or decrease in the number of dengue cases.

Comparison with other methods
To find out more about whether the forecasting method using the SCAD + Elastic Net method that has been carried out is good enough, the forecasting results need to be compared with other methods.Comparison is generated by the SCAD + Elastic Net model with another method, namely multiple linear regression [26].The models were compared based on the RMSE value where performances are contained in Table 6.In addition, in terms of determining the regression coefficient of the independent variable, the Elastic Net and SCAD models which are part of penalized regression have the ability to reduce the regression coefficient value to 0. In other words, it can eliminate independent variables that are less significant to the model [27].In multiple linear regression, all independent variables such as wind velocity, rainfall, humidity, air temperature, lag-1, and intercept (the mean value of the response variable when all predictor equals to zero) are considered in the model development.It is different with the Elastic Net model, which is only 1 variable was selected, namely lag-1, lag-1 consists of the number of dengue cases that were pushed back one day from the original data.While in the SCAD model, humidity was eliminated from the model.These results can be seen in Table 7.

CONCLUSION
Based on the results of the research that has been done, the following conclusions can be drawn: The best method for forecasting dengue fever cases is the ensemble model using a combination of SCAD + Elastic Net finalized regression with RMSE of 6.38.The logarithm transformation of the data on the number of cases does not provide better performance than normal data.It can be seen from the smallest RMSE value of the data from the ln transformation is 8.95 and for the normal data is 6.38.Based on the results of variable selection from one of ensemble forming models (Elastic Net), only the lag-1 variable has a regression coefficient that is not equal to 0, it means that in the Elastic Net regression, only the lag-1 variable is used in constructing model.While in the SCAD, there is only one variable has a regression coefficient that is equal to 0. In order to improve the forecast performance, the selection of variables need to be reconsidered.In addition to the climate factors ISSN: 2252-8938  Forecasting the number of dengue fever based on weather conditions using … (Mursyidatun Nabilah) 503 such as temperature, humidity, rainfall and wind speed, other variables can be explored for future research such as population density, population mobility, economic growth, environmental sanitation, urbanization, and community behavior.

Figure 2 .
Figure 2. Comparison between ridge vs actual


ISSN: 2252-8938Int J Artif Intell, Vol. 12, No. 1, March 2023: 496-504 500 that the BEST II scenario with the combination of SCAD + Elastic Net models has the lowest RMSE, that means the model has the lowest error compared to others.And then based on low to high RMSE, the performance of the model followed by BEST III, ALL, and lastly BEST IV scenarios.

Figure 6 .
Figure 6.Comparison between elastic net and actual

Figure 7 .
Figure 7. Forecast of climate data

Figure 8 .
Figure 8. Forecast of dengue cases' numbers in 8 weeks ahead

Table 2 .
SCAD and MCP's lambda values

Table 3 .
Penalized regression's performances Forecasting the number of dengue fever based on weather conditions using … (Mursyidatun Nabilah) 499

Table 4 .
Performances of ensemble model scenario Forecasting the number of dengue fever based on weather conditions using … (Mursyidatun Nabilah) 501

Table 5 .
Predicted vs actual dengue cases of 2019

Table 6 .
Ensemble model vs multiple linear regression

Table 7 .
regression coefficient between multiple linear regression, SCAD, and Elastic Net