Estimation of water quality index using artificial intelligence approaches and multi-linear regression

Muhammad Sani Gaya, Sani Isah Abba, Aliyu Muhammad Abdu, Abubakar Ibrahim Tukur, Mubarak Auwal Saleh, Parvaneh Esmaili, Norhaliza Abdul Wahab Department of Electrical Engineering, Kano University of Science &Technology, Wudil, Nigeria Department of PPD &M Department, Yusuf Maitama Sule University, Kano, Nigeria Department of Geography, Kano University of Science and Technology, Wudil, Nigeria Department of Electrical Engineering, Near East University, Lefkosa-North Cyprus, via Mersin 10 Turkey, Turkey Department of Control and Mechatronics, Universiti Teknologi Malaysia, Malaysia


INTRODUCTION
The concern for water quality (WQ) is quite essential for health, water resources and environmental purposes [1]. The demand by billions of individuals for clean, safe and adequate freshwater on the planet enticed the practioners and research communities to be much engaged in modeling and monitoring of water quality and to address this universal concern [2]. WQ can be described as a physical, chemical and biological characteristics of water which can be used to predict the water quality that aid in determining the extent of water purity.
Water quality index (WQI) is applied worldwide to resolve the data management issues and assess success and failures in management strategies for improving WQ [3]. In order to determine the overall status of WQ the number of sensitive parameters need to critically be identify. Since no single variables can sufficiently assess the WQ, therefore, the WQ is generally assessed by computing the broad range of parameters. As a result, large amount of data set is generated which requires to be presented in a meaningful way to decision makers, local planners and general public. In a view of this, WQI have been developed to convert the large data set in to a single index [4].
The reduction of water quality as a result of inadequate proper sanitation and pollutants coming from industries and the unreliability of most of the available mechanistic models in yielding promising forecasting results necessitated the vehement need for adopting others techniques and approaches [5]. Different methods have been used to measure and predict the quality of water in order to reduce the time consuming by collecting the data from the large data set and classify the quality using machine learning [3], but the main issues with machine learning method are high level of error susceptibility and acquisition relevant data set. Recently, a keen interest in studying the broad concept of artificial intelligence was developed, that communicate with the traditional model [6]. Despite several researchers such as [4][5][6][7][8] have used different neural network approaches in handling WQI. Nevertheless, most of the available models focus more on monitoring and analysis of water quality index. Therefore, this paper centres on estimating the water quality index through comparing the artificial intelligent approaches with conventional method applied to the Palla station along Yamuna River, India.
Artificial Neural Network (ANN) is an AI-based approach that not only proved to be effective in handling large amount of dataset, complex nonlinear input and output relationship but also flexible and powerful computational tool [5,[9][10]. Adaptive Neuro-Fuzzy Inference System (ANFIS) as another AIbased model has found to be successful tool which incorporate the approach of fuzzy Sugeno model that derived the benefit of both ANN and fuzzy logic in a single system [11].
The performances of the models were evaluated using commonly used measures. The paper is organized as follows: section 2 describes the research method and section 3 presents results and discussion while section 4 gives the conclusion.

RESEARCH METHOD 2.1. Study area
The biggest tributary of River Ganga is Yamuna River, this river is as sacred and prominent as the immense River Ganga itself. As the holy river, Yamuna covered 1,376 km, almost 57 million residents of North part of India rely upon it. A total catchment area of Yamuna is 366,223 km 2 which comprises of 42 percent of the river ganga basin located in the territory of India. Delhi as capital territory received almost 70 percent of its drinking water from Yamuna River while discharges almost 10,000 m 3 /s yearly. But due to urbanization and inadequate water treatment plant, the River leaves Delhi as polluted water [12][13]. Figure 1 shows the location of Palla station along Yamuna River basin in India. The daily WQ data were obtained from the CPCB for years 1999 to 2012.

Modelling
In this study, ANN, ANFIS and MLR models were proposed for the estimation of WQI of the river, data set were partitioned into two parts, 70% of the data were employed for calibration phase and the 30% of the data for verification purposes. Selection of dominant inputs parameters is one of the important parts in any AI based modeling. The functional expressions for the WQI are presented in (1)(2)(3)(4)(5). MATLAB 9.3 (R2017b) was used for the analysis of ANN and ANFIS while MLR model was developed using regression tool of EViews software 9.5 version.
where n WQI depicts the water quality index,  is the function of Dissolved oxygen (DO), pH, Biological Oxygen Demand (BOD), Ammonium nitrate (NH4), and Water temperature (WT).

Multilinear regression analysis (MLR)
Multi-linear regression analysis is the model applied based linear relationship between the dependent variable and independent variable. MLR is based on the concept of least squares, which is the value of the estimated parameter is expressed as a linear function [14]. As it is stated in (6).
where 1 , is the value of the th predictor, 0 is the regression constant, and is the coefficient of the th predictor.

Artificial neural network (ANN)
ANNs are mathematical model aims to handle non-linear relationship of inputoutput dataset. Historically, are information processing tools derived from analogy with biological nervous system of brain. ANN has proved to be an effective tool in predicting nonlinear systems and quite capable of handling complex noisy data set [15][16], the prediction accuracy of ANN is high [17]. Back propagation (BP) algorithm is the most common used technique among the classification of ANN. In BP, each input training data flows via the system and passes to the output layer, the error of the training is generated and propagates backward until the desired target of the network is achieved [18]. The primary aim of BPNN is to reduce the error in order for the network to learn the training data. Sigmoid and the Lavenberg-Marquardt (LM) were used as activation function and algorithm, respectively. LM used in training MLP model because of its outstanding performance [19]. Before model training at the initial stage, the data for both input and output were normalized within a scale of 0 and 1 using the as: Figure 2 shows the structure of ANN. min max min Where i X is the normalized quantity,

Adaptive neuro-fuzzy inference system (ANFIS)
The combination of artificial neural network with the fuzzy system creates a robust hybrid system that is able to solve a complex nature of relationship. ANFIS as one of the AI models has the ability to overcome the limitations of fuzzy inference and ANN. ANFIS model combine the ability of ANN and Fuzzy logic to create a process that has the ability of handling complex non-linear interactions between a set of input and output [20]. The general structure of ANFIS is shown in Figure 3. For a typical ANFIS, assuming the FIS that contains two inputs 'x' and 'y' and one output 'f', a first order Sugeno fuzzy has following rule: Membership functions parameters for x and y inputs are 1 , 1 , 2 , 2 , outlet functions' parameters are 1 , 1 , 1, 2 , 2 , 2, a five-layer neural network arrangement followed the formulation and structure of ANFIS. For more explanation of ANFIS, refer to the study in [6].

Performance evaluation criteria
The performance efficiency of the model can be assessed through different statistical measures, including Determination Coefficient (DC), Root Mean Square Error (RMSE), Mean Square Error (MSE) etc. Therefore, in order to evaluate the performance of ANN, ANFIS and MLR models, DC, RMSE and MSE were employed in this study [21]. The equation of DC and RMSE are given as: are data number, observed data, average value of the observed data and computed values, respectively. DC ranges between and  and 1 with a perfect score of1.

RESULTS AND ANALYSIS
In this paper, MLR, ANN and ANFIS were used to estimate the WQ at Pala station in Yamuna River, and their individual performance accuracy were compared. For all these models, MATLAB 9.3 (R2017b) software was used for the analysis of ANN and ANFIS while MLR model was developed using regression tool of EViews software 9.5 version. For the estimation of river parameters, different input parameters have been employed, as appropriate input selection is essential [22]. Pearson and Spearman correlation analysis methods were performed to choose the inputs parameters. Five different models and input combinations were trained based on the number and types of input, for all the methods the model types were defined as MLR-I up to MLR-V, ANN-I up to ANN-V and ANFIS-I up to ANFIS-V indicating the type of models from one to five for MLR, FFNN, and ANFIS, respectively.

Result of MLR model
MLR model was applied as the classical conventional method for modeling the linear interactions of the system. It is often used as the reference comparison model with non-linear models. The equation (11) was obtained for the best model to estimate the performance of WQI, From Table 1, it indicates that the best performing mode was MLR-V which has a total of 5 input variables, the results indicate that the MLR model is best with the highest number of input variables.
The negative values in the estimation serve no purpose in the modeling of WQI. As shown in the Table 1, the MLR performance was satisfactory for the prediction of WQI at Palla. This is proved by the value of MSE=0.00131, DC=0.8919 and RMSE=0.03625 in the verification phase. Figure 4 present the scatter and times series plots for measured and estimated WQI values for MLR model in a verification phase. The measure and estimated values were well superposed and the discrepancies between the measured and estimated values were small which indicate high prediction accuracy.

Result of ANN model
In ANN-feed forward was trained by the algorithm called Lavenberg-Marquartd. ANN was trained with a sigmoid activation function which is non-linear exponential function. It's paramount importance to make an appropriate selection of a hidden neurons and architecture of the network in order to prevent overlearning in the calibration stage. The result of ANN model is presented in Table 2. The prediction accuracy of ANN was superior than MLR model, the best model to estimate WQI was obtained to be ANN-II with the values of MSE, DC and RMSE are 9.0E-8, 0.9974 and 0.0003, respectively as shown in Table 2. Figure 5, shows the scatter and time series for measured and estimated WQI values for ANN model in a verification phase. From the comparison of Figure 4-5 it is clear that ANN are more fitted and the accuracy proved high merit over MLR model. This can also be justified by MSE between ANN and MLR models. The robustness of ANN could be attributed to the great advantage of ANN to handle complex and nonlinear system, unlike the MLR models which is base on the assumptions of linear input -output relationship.

Result of ANFIS model
ANFIS as a hybrid algorithm was employed with a suitable inference system called Takgi-Sugeno-Kang which worked based on several rules and membership function. In this study the ANFIS model consist of five input variables and one output variable in order to estimate the WQI at Palla station. ANFIS were trained base on five different models in each station, the triangular and gaussian membership function were tried to find the best model. For the purpose of this research, 5, (2, trimf, constant) indicates that, a model with 5 input variables, 2 triangular membership function input and constant output. Table 3 indicates that, the value of MSE, RMSE and DC are 8.41E-6, 0.0029 and 0.9909, respectively. ANFIS II was obtained to be the best model with two input combinations as shown in Table 3. Despite the superiority of ANN model over ANFIS model the performance accuracy of ANFIS model proved to be reliable in estimation of WQI of Palla station. Figure 6 shows scatter and time series plots for the measured and estimated WQI values for ANN model in a verification phase. From the Figure 6 it is clear that the ANFIS estimates were closer to the observed WQI value than MLR value.
However, by comparing Figures 4-6 for MLR, ANN and ANFIS modes it is clear from the Figures, that ANN and ANFIS best model proved high improvement in performance accuracy over MLR up to 10% in the verification phase. The difference between ANN and ANFIS accuracy is negligible indicating that both the models outperformed MLR model interm of estimation accuracy. The results can also be justified by presenting the box plot of the best three models, in order to demonstrate how closely the estimations models are with the observed values, as illustrated in Figure 7. According to the Figure it is appear that ANN and ANFIS prediction values resembled the observed values. In comparing with MLR and ANFIS, ANN had the best fitting because the closer the data point to be best line of fit the better the predictions (see the scatter plots). Hence, ANN can be used as reliable and superior to ANFIS and MLR for the estimation of WQI.

CONCLUSION
The paper has presented MLR, ANN, and ANFIS models for estimation of Water Quality Index (WQI), with the water quality variables as inputs. The obtained results indicated that the artificial intelligence-based models (ANN and ANFIS) outperformed conventional model (MLR) up to 10% in the verification phase. The AI models were able to accurately follow the trajectories of the observed water quality index. Although the performance of ANN is slightly better than the ANFIS, but the ANN and ANFIS models outperformed MLR model in estimating the WQI. ANN and ANFIS models are more reliable in the estimation of WQI at Palla station of Yamuna River India. In order to increase the accuracy and uncertainties problems of the models and to explore the contribution of each input combinations, further research should be carried out by employing more AI based models in estimation of WQI. The intelligent models could serve as reliable and useful tools in estimating the water quality index of the river.