Artificial neural network forecasting performance with missing value imputations

Received Nov 12, 2019 Revised Jan 25, 2020 Accepted Feb 2, 2020 This paper presents time series forecasting method in order to achieve high accuracy performance. In this study, the modern time series approach with the presence of missing values problem is developed. The artificial neural networks (ANNs) is used to forecast the future values with the missing value imputations methods used known as average, normal ratio and also the modified method. The results are validated by using mean absolute error (MAE) and root mean square error (RMSE). The result shown that by considering the right method in missing values problems can improved artificial neural network forecast accuracy. It is proven in both MAE and RMSE measurements as forecast improved from 8.75 to 4.56 and from 10.57 to 5.85 respectively. Thus, this study suggests by understanding the problem in time series data can produce accurate forecast and the correct decision making can be produced.


INTRODUCTION
Data can be obtain according to time either in hourly, daily or yearly. This form of recorded data is know as time series. By analysis the time series data, the structure of the data can be taken in building the model [1]. Many forecasting techniques have been reported in the literature [2]. In general, these techniques can be classified roughly into three groups, i.e. statistical modified-classical methods, techniques based on artificial intelligences, and advanced techniques.
The most important classical method is the autoregressive integrated moving average (ARIMA) model because of its flexibility in modelling different types of dataset [1,3]. This model is executed from the autoregressive model (AR), the moving average model (MA) and the combination of AR and MA models, which is known as ARMA models. In addition, if there is an existence of seasonal component in the series, then the model is known as seasonal ARIMA (SARIMA) model [3]. Thus, the flexibility of this model make it competitive with the recently developed methods. However, the major limitation of this model is that it can only capture the linear form of time series data and the preliminary analysis stages become the constraint in building this model [4].
As an alternative to the classical methods, forecasters have developed new methods that can overcome the limitation of classical methods, such as artificial neural network (ANN) and the fuzzy time series (FTS). ANN has been widely used as a forecasting model in many applications [5]. This include  [6], electricity price [7], airline data [8], and many others. It is because ANN is flexible in forecasting applications since it can model both linear and non-linear processes [9]. In developing ANN model, the pre-processing of data is important before analysis where the data transformation normally used is [-1, 1] or [0, 1] [10][11][12]. Different from previous study, the analysis take the important of imputation the missing data into ANN forecast. The forecast performance will be validate using real data which is air quality data.
Air pollution is the major pollution problem in the world [13]. Power production from power plant, vehicles fuel burning, industrial processes and natural factors like volcano eruption make the air quality worsen. The issues of air quality now become a major concern worldwide as its effects are diverse and numerous [14]. The pollution not only effect human health but also towards the forests, waters and whole ecosystem.
In Malaysia, a series of haze episodes were reported since the 1980s [15]. Massive land and forest fires in Sumatra and Kalimantan, Indonesia has been the main reason of haze episodes occurrence. The winds has made it easier for the heavy haze to be transported. According to DOE report [16], for the first time in Malaysia's history, 34 stations in this country recorded unhealthy air quality status which happened on 15 September 2015. Besides Malaysia, haze also reaches another Southeast Asia country such as Singapore, Thailand and Brunei [17].
The Department of Environment (DOE) is a government agency which is responsible to monitor and manage Malaysia's air quality. Thus, to identify and give information on the severity of air pollution to the public, the ambient air quality measurement in Malaysia is described in terms of Air Pollutant Index (API). Based on the average of main pollutants namely sulphur dioxide (SO2), nitrogen dioxide (NO2), carbon monoxide (CO), ozone (O2), particulate matter diameter 2.5 (PM2.5) and particulate matter diameter 10 (PM10), the API value is measured. The highest pollutant's concentration will determine the API value. Usually, PM2.5 is the highest concentration recorded compared to other pollutants.
Air quality data has been recorded in Malaysia since 1996 and the huge amount of data usually presented in the form of text information. Thus, air quality information are difficult to be reviewed, especially for the public understanding. Moreover, the public, especially those in high risk groups such as asthmatic individuals, children, and elderly, need to be alerted beforehand about the cases of poor air quality. Therefore, this study use time series approach by using classical and modern methods which are Box-Jenkins and ANN to solve forecast accuracy issued with the presence of missing data. It is important to implement air quality management and public warning strategies for pollution levels that are acceptable to the public.

RESEARCH METHOD 2.1. Box-Jenkins method
Box-Jenkins method or autoregressive integrated moving average (ARIMA) method was first introduced by Box and Jenkins [18]. Originated from the autoregressive model (AR), the moving average model (MA) and differencing order of d known as the integrated (I) model. The seasonal ARIMA model (SARIMA) is used when the seasonal components are included in this ARIMA model. The generalized form of SARIMA (p, d, q)(P, D, Q) S model can be written as: where B is denoted as the backward shift operator, d and D are denoted as the non-seasonal and seasonal orders of difference respectively. Box-Jenkins procedure contains three main stages to build an ARIMA model, i.e. model identification, model estimation and model checking.

Artificial neural network
Artificial neural network (ANN) is one of the artificial intelligence approaches. It is one of the most accurate and widely used forecasting methods. Multi-layer perceptron (MLP) or also known as the feedforward neural network (FFNN) is broadly used as ANN approach [10,12]. The term perceptron refers to the  [19] as shown in Figure 1. Each input node in the input layer will be forwarded to the neurons with the arrival of a certain weight [20]. Input will be processed by a backpropagation function which will add up the values of all weights. This sum will be compared with a threshold value given by the activation function of each neuron. Commonly, in the hidden layer, the activation function used is the logistic function, ( ) = 1/(1 − exp(− )), meanwhile the linear function, ( ) = , is used at the output stage. If the input is passed a certain threshold, then the neuron will be activated. When the neurons are activated, the neuron will transmit output via the output weights to all neurons associated with it. There are constants or bias (in NN jargon) connected to each neurons and output, denoted as one [8]. MLP is trained by back propagation learning which is capable to solve more complex problems compared with single layer nets and outliers [21]. This procedure repeatedly modifies the weights on the connection links in a NN so that it minimizes the difference between actual output and the desired output. MLP model in statistics modelling for time series forecasting can be considered as a non-linear autoregressive (AR) model. In time series forecasting, the input node is the lag(s) of available historical data determined based on the autoregressive order in the Box-Jenkins model [8,5].

Missing values imputation 2.3.1. Decomposition method
The basic idea for the decomposition method is to decompose the problem into sub problems, which is used as the solution for various problems and algorithms. The sub problems are trend, seasonal, cyclical and irregular (error). The estimates from these factors are used to describe the series and can be used to compute point forecasts. This method can be presented into two forms; an additive decomposition and multiplicative decomposition. The equation of both forms with the factors can be presented as below. Additive decomposition: Multiplicative decomposition: where , , and are trend, seasonal, cyclic and irregular at time respectively.

Spatial weighting method
The analysis of missing values using the spatial weighting methods will involve a target station with selected neighboring stations. Generally, the weighting method formula is given as follows: where is the estimated value of the missing data at the target station, ( ≠ ), N is the number of neighboring stations, is the observation at theith neighboring station and is the weight of the ith neighboring station with constraint = 1. The arithmetic average (AA) method is the classically way to identify weight. It considered equal weight for each selected neighboring station. It can be defined as: The second method is the normal ratio (NR) method. The NR method was firstly proposed by [22]. The method is based on the mean ratio of available data between the target station, and the ith neighboring stations. The method is given as follows: where µ and µ are the sample mean of the available data at the target station , and the ith neighboring stations respectively. In 1992, Young proposed to use the correlation between the target station and the neighboring station as the weighting factors [23]. The weight known as the modified normal ratio based on correlation (MNR) is given as follows: where is the correlation coefficient of the daily time series data between the target station and the ith neighboring stations, is the length of data series that are used to compute the correlation coefficient.

Error measurement
Let be the actual values, ̂ is the forecast values and is time. Thus, the error be defined as, = −̂. The measurements used in this study are mean absolute error (MAE) and root mean square error (RMSE). The equation for both measurements as follow: The MAE and RMSE are scale dependent measure where both not suitable to compare the forecast with different scale. Both of these measurements are easy to interpret since the error can be computed directly from the actual and forecast values without involving any unknown parameter that needs to be estimated [24].

RESULTS AND DISCUSSION
The analyses of API data are presented in this section. Station located in Johor Bahru city was chosen for this study since it is the capital of Johor state and the second largest metropolitan in Malaysia. Thus, it is home to a large number of the region's industries, residential, and commercial hotspots. The study  (2) a test data set in year 2011 with a total of 365 observations to check the model performance. The missing data were initially estimated by using the decomposition method. Figure 2 show the time series plot for both training and testing data set.

Figure 2. Time series plot for daily API in Johor Bahru
From Figure 2, the API series consist yearly seasonality where the data start to increase in the middle of the year. The seasonality indicate nonstationary data. Thus, the data transformation and differencing in both seasonal and nonseasonal were carried out to obtain stationary series.
The data pre-processing for ANN method used in Johor Bahru daily API data included the normalize data transformation and scaled interval transformation of [0,1] and [-1,1]. In input layer, the input nodes were identified based on all lags from the best SARIMA model (SARIMA(3, 1, 3)(0, 1, 1) 365 ), seasonal lags which are 365 and 730 and lastly lag 1 with before and after seasonal lags (1, 364, 365, 366, 729, 730 and 731). Table 1 showed that the smallest RMSE with value of 10.57 was by using all input lags from the best SARIMA model with the data transformation of [-1,1].
Missing data or incomplete data matrices is a problem that is repeatedly encountered in many areas, including the environmental research. This is common and unavoidable problem caused by unsystematic data storing, instrument malfunctions, and stations relocation [25]. Missing data can lead to insufficient data sampling, errors in measurements, and it gives a significant effect to the conclusions that could be drawn from the data [26]. In time series, the analysis requires the data to be continuously available. Thus, an appropriate statistical method is important to solve the missing data problem. The spatial weighting method was used since numbers of monitoring stations were available near to the selected stations. Imputation methods were tested in different distances, 100 km, 150 km, and 200 km in order to test the method's sensitivity so that optimal result can be produced.
To check the stability of the method, six different percentage ranges from 5% to 30% were chosen. Performance of the imputation methods are compared by using MAE measurement. Lowest MAE indicate better imputation. As shown in Table 2, in 100 km distance, MNR was the best method in all percentage missing. In 150 km distance, the result varied between old normal ratio, NR and modified normal ratio, MNR. The best imputation for NR were in 5% and 15% percentage missing, while for MNR, the best imputation were in 10%, 20%, 25%, and 30% percentage missing. Consistent result also obtained in 200 km as it indicated that the MNR was the best imputation method. Thus, as conclusion MNR was the best method while decomposition was the worst method. The similar processes are conducted after missing values imputation performed. The new performance evaluations were given in

CONCLUSION
This paper presents the problem of how to forecast the huge amount of API dataset contaminated with different ranges of the API scales and terms that used in assessing and describing the air quality status on human health. The objective of the study was to adapt time series method, classical method and modern method that would yield satisfactory workable results for the API forecasts with missing data problem. From the result, it is evident that the modern method, artificial neural network (ANN) gave better forecasting performances in forecasting compared to SARIMA model. Besides, ANN showed improvement in forecasting after high accuracy data imputation conducted. The ability of ANN to capture complex data pattern which consist both linear and nonlinear make the reason for the good performance of ANN model [9]. For future recommendation, this study can be extend with input from the other pollutant in target and neighboring stations.