Prediction of the effects of environmental factors towards COVID-19 outbreak using AI-based models

The need for elucidating the effects of environmental factors in the determination of the novel corona virus (COVID-19) is very vital. This study is a methodological study to compare three different test models (1. Artificial neural networks (ANN), 2. Adaptive neuro fuzzy inference system (ANFIS), 3. A linear classical model (MLR)) used to determine the relationship between COVID-19 spread and environmental factors (temperature, humidity and wind). These data were obtained from the studies (Pirouz, Haghshenas, Haghshenas, & Piro, 2020) with confirmed COVID-19 patients in Wuhan, China, using temperature, humidity and wind as the independent variables. The measured and the predicted results were checked based on three different performance indices; Root mean square error (RMSE), determination coefficient (R 2 ) and correlation coefficient (R). The results showed that ANFIS and ANN are more promising over the classical MLR models having an average R-values of 0.90 in both calibration and verification stages. The findings indicated that ANFIS outperformed MLR and ANN. In addition, their performance skills boosted up to 25% and 9% respectively based on the determination coefficient for the prediction of confirmed COVID-19 cases in Wuhan city of China. Overall, the results depict the reliability and ability of AI-based models (ANFIS and ANN) for the simulation of COVID-19 using the effects of various environmental variables.


INTRODUCTION
The novel coronavirus (SARS-CoV II) also known as COVID-19 is a wrapped RNA virus that is spread extensively among people, birds and different mammals, it causes respiratory, enteric, hepatic, and  19 is an emerging and re-emerging pandemic infectious disease, which is of global concern to the public health [1]. COVID-19 first appeared in December 2019 in Wuhan city of China reporting the first 4 cases in the world, and this might be connected to the Southern China (Huanan) seafood wholesale market. Moreover, individual transmission cases were found to be rapidly increasing asymptomatically [2]. In the beginning, regional epidemic has since quickly enlarged into global pandemic. The COVID-19 affected 212 countries around the world with huge morbidity and mortality rate, with more than 3,700,00 people infected with the disease [3].
The mode of transmission of COVID-19 can be the same as for other respiratory diseases, which can be transmitted through droplets of various sizes. A research on 75,465 subjects which shows that COVID-19 infection is basically transmitted among individuals through droplet and contact routes and not through airborne transmission [4]. Even though, other modes of transmission might be possible. Researchers are devoting their time and skills towards the mechanism and modes of transmission of this virus [5]. In early 2020, this disease has spread quickly around the world. In the highlight of the potential danger of this pandemic, researchers and medical experts have been doing their best to comprehend this new infection and the pathophysiology of this disease to reveal likely treatment regimens and find the efficient therapeutic agents as well as the vaccine [6].
For instance Çolak et al. performed a retrospective case-control study. 124 patients who had been identified to have CAD by coronary angiography (in any event 1 coronary stenosis > half in major epicardial courses) were joined up with the work. Angiographically, the 113-social order (2) with typical coronary arteries were taken as control subjects. Multi-layered perceptions artificial neural network (MLP-ANN) engineering were applied. The ANN models prepared with various learning algorithms were acted in 237 records, isolated into preparing (n=171) and testing (n=66) data set. The presentation of expectation was assessed by sensitivity, specificity and accuracy values with regards to standard definitions. In addition, the outcomes have shown that ANN models trained with eight various learning algorithms are promising a direct result of high (greater than 71%) sensitivity, specificity and accuracy values in the forecast of CAD. Accuracy, sensitivity and specificity values differed between 83.63%-100%, 86.46%-100% and 74.67%-100% for preparing, separately. For testing, the qualities were over 71% for sensitivity, 76% for specificity and 81% for accuracy [7]. Also, Fazilic et al. reported the application of ANFIS in a research and has been tested and applied on several studies in predicting a disease for prediction of dermatological diseases [8].
The environment in which COVID 19 virus is suspended can significantly influence the survival and transmission of the virus. Although it has been shown that transmission of the respiratory virus is by human to human route, either by inhaling the aerosols sneezed or coughed out by infected person or by touching infected surfaces and getting the droplets pass through eyes, mouth or nose (the T zone). It is still imperative to know that the ability of the virus to survive in various surfaces differs greatly [9]. As a general concept, viruses including the coronavirus has been shown to live for a long period on objects outside the body of the host organism. At room temperature, the virus can survive for days while at higher temperatures, the virus can survive for much less period. COVID-19 can survive for hours on sterile surfaces, aluminum or surgical gloves which increases the probability to get infected via contact, exhaled droplets can stay as aerosols for some time thereby enhancing a distanced human to human transmission through the movement of air (wind effect). Subsequently, transmission by fecal route might be possible as it is shown that some fecal sample of infected individuals has tested positive for the virus. A study shows that the virus can survive for a period of 4 days in stool and the virus can be infectious in sewage and water for weeks, these suggests that there is need for further investigation on role of contaminated sewage on transmission of the disease [10]. For about 5 to 6 decades, numerous articles and publications have been reported in the technical literature depicting the effect of environmental factors such as temperature, relative humidity and wind on the survival and spread of viral agents. The survival and transmission of airborne infection depend on the dissemination of the virus in the index person and the transfer of the virus to a secondary host. During this journey, environmental factors plays a major role in the transmission chain [11]. For instance, Pirouz et al. reported that artificial intelligence and regression analysis are strong and reliable tools in predicting the novel COVID-19 outbreak using various environmental factors [1]. It is imperative to note that since the creation of the novel AI-based models to our knowledge this is the first research conducted in the literature, showing the combined applications of ANN and ANFIS in the prediction of COVID-19 outbreak using various environmental factors.
One of the major reasons of applying these models is due to the fact that in order to generate a consistent predicting approach various models might not be enough due to the nature of dynamic properties of the measured data. Therefore, this makes it necessary for modellers to develop and construct efficient and stronger models with the help of the current and existing data in hand. According to the studied established in the literature, the tradional regression method were the widely employed approaches, which have lower precison and sensitivity. Therefore, this brings about the need for the development of the robust non-linear AI-based techniques [12].
This work is aimed to determine the applications of two different non-linear models (ANN and ANFIS) with a linear classical model MLR to predict the outbreak of COVID-19 in Wuhan city, China using various environmental factors such as wind, temperature and humidity as the input variable

MATERIAL AND METHOD 2.1. Instrumentation 2.1.1. Humidity
The humidity is measured by digital hygrometer and a built-in position sensor. The sensor comprises of a polyimide film spin-coated onto a Si substrate. The sensor model for signal processing, which portrays the sensor capacitance with regard to relative humidity and temperature closed-path [13].

Temperature
A thermometer is an instrument used in measuring temperature; a thermometer can be used in measuring the temperature of liquids, solids and gases. It is equally used in measuring temperature of air as used in this study [14].

Wind
The instrument used in measuring wind speed in this experiment is an anemometer (a type of weather instrument used to measure wind speed and direction [15].

Proposed methodology
In this study, the data was taken from historical experimental results from a study conducted by [1] to predict the outbreak of novel COVID-19 disease using different environmental factors by applying linear and non-linear models ANN, ANFIS, and MLR. The data was separately proposed in order to investigate the environmental factors that affect the confirmed cases of coronavirus; the data of previous research [1] was collected to identify the relationship between temperature, humidity and wind speed with confirmed cases of coronavirus.
Five variables were used as inputs variables; maximum temperature, minimum temperature, average temperature, an average humidity and wind speed kilometer per hour. The Confirmed case is considered as an output parameter. The data was collected for a period of 30 days, which is composed of 64 instances for each of the variables. In the development of this research, Figure 1 shows the flowchart of the AI-based models and experimental methods applied. Whereby, the flow chart describes and summarize the overall study, starting with the experimental analysis to determine the values of the temperature, humidity and wind as the corresponding input variables. An instrumentalist will determine whether there is an instrumental error or not as shown in the flow chart by checking the results. The study proceeds with the data driven approach through pre-processing method, which involves preliminary data analysis such as correlation analysis and statistical analysis. The models are further employed, and their predictive performance was evaluated as shown in Figure 1.

Artificial neural networks (ANN)
Artificial neural networks (ANNs) are generally computational data driven models used in emulating and mimicking how the human brain interpret and translate information. They equally composed of various neurons as well as units for processing that are interconnected with adaptable biases and weights [16]. This current study employs the used of backpropagation algorithms. Based on the technical literature, ANNs are systems applied to process information, which are designed like human brain consisting of a basic unit known as node (neuron) [17].
Therefore, backpropagation is employed in determining the error, which is calculated by taking the difference of the simulated values and the measured values. The general equation can be expressed as (1):

Adaptive neuro-fuzzy inference system (ANFIS)
ANFIS was proved as a successful software that incorporates the approach of fuzzy surgeon model that shows the benefit of both fuzzy logic and ANN in one system. ANFIS is used recently in predicting and modelling complex dataset [18]. ANFIS is also a real-world estimator because of its capacity to approximate the real functions. Fuzzy logic converts the input data into fuzzy values via the application of membership functions. The numbers range between 0-1 [19]. Furthermore, in ANFIS model nodes works as membership function (MFs) and also allows the modelling between the relations of the input with the output.
Assume the FIS contains two inputs 'x' and 'y' and one output 'f', a first-order Sugeno fuzzy has the following rules.

Multiple linear regression (MLR)
This is one of the trivial and classical method used in prediction in engineering, science, health science and social sciences. It is generally classified into two main groups; the simple and multiple linear regression. Each of these classes can be used depending on the aim. For example, if the study involves a single output and single input variable, it is said to be known as simple linear regression (SLR). Furthermore, if we want to check the relation between more than two inputs with a single output linear, therefore such is a multiple linear regression (MLR) [20]. Usually, MLR is the linear regression type that is used universally, and it involves analysis in the form that every value from the input input parameter to be related with the output [21]. Generally, this technique consists of estimating the level of correlation that is between a single response variable that is the dependent and two or more predictors that is independent variables [22].
The general equation of MLR can be shown in (4).
Where x_1, is the value of the th predictor, b_0 is the regression constant, and b_iis the coefficient of the th predictor.

Model validation
In any computational data-driven approach, the basic aim is to fit the models to a given data sets based on the employed indicators in order to produce a satisfactory prediction of the unknown data set [1]. Considering issues such as overfitting, reliable calibration performance is not agreement with the verification Int J Artif Intell ISSN: 2252-8938 Prediction of the effects of environmental factors towards COVID-19 outbreak… (Khalid Mahmoud) 39 performance always. Generally, we have different classes of validation consisting of; the popular crossvalidation. The K-fold is an example of the cross-validation, which is employed in this study. In this validation method, the data is classified randomly to two differents sets called, the verification and the calibration phases [2]. Among the advantages of this validation approach is that in each round, the training and validation data sets are not dependent upon each other. Which leads to the provision of higher performance accuaracy [23]. As stated above, the data is further divided into categories 75% for the calibration (training) and 25% for the testing (verification) stage. Considering the 4-fold cross-validation. It is very important to note that other validation methods can be applied to the data set [24].

Models development
In the development of these models, the simulation was done in MATLAB 9.3 (R2019a). For ANN model, a special algorithm known as Levenberg-Marquardt was used by employing 1,000 iterations, coefficient of momentum of 0.9, learning speed of 0.01 as well as an MSE of 0.0001. The best architecture of the model was optimized and selected through the use of trial by error method. In modelling of ANFIS, different kinds of epoch itereations as well as membership function (MFs) were used in order to recognize the suitable model architecture. While, the deterministic linear MLR model was developed using the simulation tool in the EViews software 9.5.

Applications of the data driven approaches result
AI-based models (ANFIS and ANN) with a linear model MLR were employed to predict the effects of environmental factors on COVID-19 outbreak in Wuhan City of China based on historical data. Prior to the modelling, statistical and correlation data analysis was conducted as shown in Table 1. In order to understand the behavior and science of the historical data, the relationship that exists among all the variables involved more especially the dependent (output) and the independent (input) variables. From the correlation and statistical analysis, the science of the data can be well understood prior to navigating into the simulation. The statiscal analysis was demonstrated based on the mean, meadian, standard deviation, the minimum as well as the maximum number of each of the variables involve in this study as shown in Table 1.
It can be observed as shown in Table 1 that there is a strong inverse correlation that exists between Max T °C and the confirmed cases having an R-value=-0.53401. Therefore, this can validate the hypothesis proposed by various scientist and medical experts that at higher temperature or at temperate regions this virus may be expose to death. Table 1 equally shows moderate and direct relationship between average humidity and the confirmed case with R=0.382997. The weakest correlation exists between wind and confirmed cases having R=0.072453.
Based on the comparative prediction results of the models, it can be observed clearly that the AIbased models i.e. ANN and ANFIS show superiority over the traditional linear regression model MLR with considerable performance. Table 2 equally indicates that all the three models show good result in the training phase with a lower R2-value of 0.9076 despite the fact that the MLR performance in the testing phase is not reliable due to its lower determination of co-efficient value as well as the higher root mean square error value. Further descriptive of the result shows that ANFIS outperformed all the three models and increased the performance accuracy of MLR and ANN up to 25% and 9% respectively based on their root mean square error as shown in Figure 2.  Figure 2. Root mean square error of the models in both training and testing phases Moreover, the predictive results were equally demonstrated using a graphical illustration (scatter plot) to display the relationship that exists between the measured and the simulated values as shown in Figure 3. It is clear from the illustration that ANFIS and ANN show higher fitting agreement between the measured and the simulated values. The more robust prediction accuracy of the confirmed COVID-19 cases is related to the higher correlation that exists between the variables, as shown in Table 1.  Apart from the determination coefficient (R2) shown in Table 2 the advantages of ANFIS and ANN in comparison with the traditional MLR model is that mostly MLR fails at a certain point especially when it encountered a highly complex and sophisticated non-linear data, which is since MLR model follows the mechanism of least-squares method. Besides, the major reason is the generation of negative results that can hinder the performance of the model. Figure 4 depicts the performance of the models for the confirmed COVID-19 cases in Wuhan city of China using a radar chart that shows the scale of R in both the training and testing stages. The predictive comparison of the results can be arranged as ANFIS>ANN>MLR for the prediction of confirmed cases of COVID-19 in the Wuhan City of China using various environmental variables. Figure 5 demonstrated the response of the models based on time series plots. According to this plot, the extent by which the values were spread between the measured and the predicted models proved Table 2. This result is in line with [25][26][27].

CONCLUSION
The threat of COVID-19 is of global concern that need to be addressed quickly and rapidly concerning its negative impact worldwide. The need for predicting and simulating the effects of these environmental factors for the elucidation of novel COVID-19 outbreak using artificial intelligence (AI) is of paramount importance. It is therefore, significant to simulate the effects of these environmental factors against the confirmed cases of COVID-19 disease using different models. In this work, three different models (ANFIS, ANN and MLR) were employed. In the data-driven method, the data was collected from a previous historical study published in the literature, referenced in the material and method section. The comparative results proved the ability of the AI-based models (ANFIS and ANN) over the traditional regression model (MLR) in predicting the confirmed cases. The results equally showed the strength of ANFIS as a hybrid model in outperforming the other two models and increased their predictive performance accuracy up to 25% in the testing phase. Mostly, the non-linear models displayed higher prediction accuracy than the classical linear models and hence regarded as reliable for predicting the effects of environmental factors towards COVID-19 outbreak. Other non-linear models such as support vector machine (SVM), hammerstein-weiner  4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

42
(HW), fuzzy logic (FL) as well as different optimization algorithms such as genetic algorithms (GA) are recommended in order to improve the performance accuracy of the modelling. This work is not only restricted to china in fact the simulation of various environmental factors towards determining number of COVID-19 cases is highly recommended in other parts of the world.