Comparison of daily rainfall forecasting using multilayer perceptron neural network model

,


INTRODUCTION
Back in 2017, half of Pulau Pinang had submerged with floodwaters particularly in island as well as in mainland. It had caused a lot of damages and fatal in some areas. It is needed to find a suitable method that can predict long term prediction of rainfall. From finding a suitable method will be able to assist the authorities or certain parties to be well-prepared and make plans to prevent these water-related problems from happening.
Other than that, rainfall forecasting is very important in agriculture fields which can also help in decision making and performing strategic planning. The ability to predict and forecast rainfall quantitatively can help crop planting decisions, reservoir water resource allocation, traffic control, the operation of sewer systems and confronting water-related problems such as flood and drought [1].
Previous researchers have shown an increased interest in model development of time series in using rainfall data. There were several attempts in forecasting rainfall data using various techniques and methods in which can produce a well development model. Forecasting method has become very popular

RESEARCH METHOD
There are several methods that can be used to forecast rainfall data. In this study, two methods were applied to forecast rainfall data which are ARIMA model and ANN model. Firstly, the data need to undergoes data pre-processing before proceeding with any time series method. These techniques are involved with data normalization, data lagging and data splitting.
Data normalization is one of methods used in data pre-processing to obtain the precision of the forecast of the model. The range of data used in this method were between [0,1]. The purpose to use this normalization range is because sigmoid (logistic) activation function was used in this study. Next, for data lagging is where it involves with layers of nodes which consist of input nodes, hidden layers and output nodes. In this research, the input variables are constructed through trial and error method.
The data are need to be splitting into training and testing set. In data splitting, training set will have more allocation than testing set with a proportion of 90% versus 10%, 80% versus 20% or 70%  [27]. Throughout this study, the data will be splitted into 80% and 20% which consists of 876 training data and 220 testing data. Other than that, model assumptions are also being applied in the preliminary test to check the stationarity and normality of the data for forecasting purposes.

Study area
For this analysis, daily rainfall data from station 5204048, Simpang Ampat, Pulau Pinang as a case study. The data was taken for 3 years which was from January 2016 until December 2018. The data consists of information on daily totals of rainfall (in mm), minimum, maximum and total rainfall per month as well as annual rainfall. The daily totals of rainfall were used as the variables for this research. After that, the data undergoes data normalizing, lagging and splitting process. Figure 1 shows the map of the study area.

ARIMA model
ARIMA model who was introduced by Box & Jenkins is one of the most popular forecasting methods in research and practice. Generally, ARIMA model is referred to as an ARIMA model where p is the order of the AR component, d is the degree of the number of times the series has been differencing and q is the order of the MA component which are non-negative integers [28]. These Box-Jenkins procedures were involved with model identification, parameter estimation and model diagnostic checking [29].
The first step in model identification are to determine the time series is stationary or not stationary. If the data is not stationary, a non-seasonal differencing can be applied to the data to make it stationary. After that, the models can be identified according to the guideline of the autocorrelations functions (ACF) plot and partial autocorrelations function (PACF) plot. The guideline for the model identification are shown in Table 1. that has been selected. Parameters that are judged significantly different from zero are retained in the fitted model while parameters that are not significant are dropped from the model. Last but least, the adequacy of the model must be check in model diagnostic checking step. Box-Ljung test is used for testing the lack of fit of a time series model and residual of the time series are correlated or uncorrelated after fitting an ARIMA model to the data. The model with the smallest p-value of the estimated parameter value and the highest p-value of the Box-Ljung test was chosen. Other than that, the model is also being selected using the Akaike Information Criterion (AIC) criteria. The model with the smallest AIC value will be chosen which shows an adequate model.

ANN model
ANN is one of the other methods that can be applied in time series analysis which is widely used by the researchers. There are several steps that are required in order to successfully forecast the neural networks model. The steps are network architecture, learning algorithm and the activation functions. Data normalization are often performed before executing with the ANN model in training process. The input data need to be normalized according to the activation function used. It can help to minimize the error of the model.
In this study, MLP structure consists of three layers which are input layer, hidden layer and output layer. A total of 35 MLP network models were developed using daily rainfall data. For input nodes, it is determined according to data lagging technique. The application of using data lagging technique was to evaluate the forecasting performance of the model in details and the capability of the models. The generated lagging observations were obtained from trial and error of input variables. The training number of hidden layer nodes used in this modelling are from 2 to 10 which is based on previous researcher's paper and also through trial and error. Furthermore, there is only one node used for output layer.
In modelling of ANN model for daily rainfall data, the network that applied was MLP which contains input, hidden and output layer. The models were trained based on learning method which is the gradient descent back-propagation algorithm. This algorithm consists of two parameters which are learning rate (lr) and momentum coefficient (mc). The parameters were determine based on previous researcher's paper and through trial and error method. The neural network model was trained with lr of 0.3, and mc parameters of 0.2 and number of training epochs was 1000.
For the activation function, two activation function were needed to link the neurons. For this study, Sigmoid (logistic) activation function was used for the hidden layer. This activation function keeps the range for the hidden layers to be within 0 to 1. Next, linear activation function was used at the output layer as there is only one result that is generated at the output layer, so the used of linear function is acceptable. Both of the equation are as follows: (1) (2) where x is the input value.

Forecasting performance measurement
The performances of the models are evaluated by calculating difference between the observed rainfall data and the model generated rainfall data. According to [30][31][32], there are several performance evaluation methods which could be used for hydrological forecasting model. For this study, the forecasting performance is evaluated by using Mean Absolute Error (MAE), Mean Forecast Error (MFE), Root Mean Square Error (RMSE) and coefficient of determination (R ). The forecasting model that provides the smallest value of MAE, MFE and RMSE were appointed as the best model for forecasting. In addition, for the value for R which are between 0 and 1 were chosen which shows how well the data can fit the model. The formula is shown below: where is the observed value at period t; ̂ is predicted value at period t; is the number of periods used in calculation; is the correlation coefficient; is the coefficient of determination; is the number of pairs of data; is the observed value of rainfall data; is the predicted value of rainfall data.

RESULTS AND DISCUSSION
The daily rainfall data sets were successfully forecasted using both ARIMA and ANN models. ARIMA (3,1,1) have the significant p-value for each of the parameters which is less than the significant level of = 0.05. Moreover, the smallest AIC value and largest Box-Ljung test are being selected with the value of 11060.89 and 0.2313 as the best model to forecast daily rainfall in Simpang Ampat, Pulau Pinang in ARIMA method. Furthermore, ANN model is capable of predicting and efficiently as this method were involve in a nonlinear modelling of rainfall data. The ANN structure consists of seven input nodes, two to ten hidden layer nodes and one output nodes. A Feed Forward Back-Propagation Neural Network of ANN model was developed where the model was trained based on gradient descent back-propagation algorithm with sigmoid (logistic) activation function for hidden layer and linear activation function for output layer. ANN (6,4,1) model was trained and tested.
Next, the forecasting accuracy was measured according to observed and predicted value of the models. ARIMA and ANN models were evaluated and compared to see which method provides the most appropriate forecasting tools to forecast daily rainfall data. There is a difference based on the results obtained from accuracy checking between ARIMA and ANN models. The measurement of the level of accuracy is based on MAE, MFE, RMSE and R criteria.
From the analysis results, both of the models are capable as a forecasting tools to forecast daily rainfall data. The models that provides the smallest error of MAE, MFE, RMSE and the highest R were appointed as the best forecasting models. While for MFE, if the value of MFE is larger than 0, it indicates the model is under-forecast. If it is less than 0, the model may lead to over-forecast. The MFE value for training and testing of ARIMA model were 2.9465 and -3.3436 which indicates that the model is under-forecast during training the model and tends to be over-forecast in testing set. This means that forecast for ARIMA model to be low in relation to the actual demand in training set. While for testing set, the forecast value is high in relation to actual demand. However, for ANN model, undergoes under-forecast in both training and testing set. The model tends to be under-forecast with an average absolute error of 13.1511 in training and 2.2188 in testing which the forecast has been in a low relation to the actual demand. n xy x y n x x n y y The two models were being compared based on RMSE value where it is used to computes the variations of the observed values and the predicted values. The smaller the value of RMSE, the better and accurate the results of the forecast would be. The RMSE value for ANN model are lower, while for ARIMA model is higher. This means that ANN model can provide more accurate results in forecasting daily rainfall data.
The coefficient of determination, R can also be measured and compared. This measurement error helps to see whether it gives the fluctuations percentages of one variables that is predicted from other variables. The value of R is between 0 and 1 where it denotes the strength of a linear association between x and y. In addition, it represents the percent of data that is closest to the line of best fit. The closer its value to 1, the better the fit or relationship between the variables. The results of ANN model for training and testing set shows the value of R is between 0 and 1 that is high at 0.8227 and 0.9432. ANN model shows a better fit and positive relationship between the variables compared to ARIMA model. Hence, ANN model outperforms ARIMA model and proven to model and forecast daily rainfall data. Figure 2 displayed a graph of comparison of ARIMA model versus ANN model.

CONCLUSION
Based on the results obtained, the two models are compared and evaluated in order to find the best forecasting model to forecast rainfall data. The purpose of comparing these two models was to find the most suitable method to forecast daily rainfall data which contains the minimum accuracy measure. For this comparison, there are three types of error measure that was being used to evaluate the accuracy measure of the models which are MAE, MFE and RMSE. The smaller the error, the more accurate the forecasted results of the models would achieve. Moreover, the coefficient of determination, R were also being measured as to inspect the better fit or relationship between the variables. The model that has the lowest error and R closest value to 1 for better fit of the variables are selected as the best model.
The ANN (6,4,1) shows the MFE value of 2.2188 that is smaller compared to ARIMA (3,1,1). It shows that ANN model is under-forecast which the forecasted results has been in a low relation to actual demand of 2.2188 mm in average for that day to rain. While, MFE for ARIMA (3,1,1) has the value of