An adjustment degree of fitting on fuzzy linear regression model toward manufacturing income

ABSTRACT


INTRODUCTION
A regression model is one of the most popular statistical models used to determine association between multivariate data [1]. This model is commonly used in the fields of applied sciences, computer science, social sciences, engineering, and economics [2]. Regression analysis was used when an explanatory variable is dependent on a response variable. It shows the value of the response variable changing when one of the explanatory variables varies while the rest remain unchanged [3].
However, statistical modelling cannot be used on all data. The traditional method would not be able to accurately determine a result when vague data exist. Thus, for studies related to the association between dependent and independent variables, the regression method unable to predict precisely because of an unpredicted event. Some data may not necessarily be normal in some situations, particularly when predicting income. In some cases, data are not normal due to outliers or missing values. Due to this, the existing method was unable to estimate and find out the results accurately. To deal with that situation, alternative approaches are necessary. Many fields, especially engineering and technology, use the fuzzy method in the analysis of uncertain data [4]. In complex systems involving human estimation, fuzzy methods can be applied to analyze uncertain or imprecise data between response and explanatory variables [5]. First introduced by Tanaka, fuzzy linear regression (FLR) has demonstrated its usefulness in solving complex problems where many cases are difficult to quantify [6]. FLR can provide an approximation between variables with insufficient uncertainty information [7].
Zolfaghari on the other hand has introduced an extension model that involved triangular fuzzy numbers (TFNs). This model considers either symmetrical or asymmetrical with its membership function (MF). The model also considered two parameter estimation factors, which is the degree of fitting [8]. The factors of parameter estimation can be transformed into two ways which are linear programming and fuzzy least squares method. Previous studies have proved the multiple linear regression method with the fuzzy regression technique according to various fields of study [9]- [13].
Previous researcher proposed the least squares method as a common FLR model [14]. However, the model showed an influenced towards the outliers, which led to inaccuracies. Another fuzzy model, based on least absolute deviation was introduced as an alternative to cater an outlier issue [15]. In addition, it works well on both symmetrical and non-symmetrical data. By applying least absolute deviation approach, a model from fuzzy numbers of matrices is created accordingly. A study by [16] has proven that fuzzy model based on least absolute deviation performed better and more structured compared to least squares method.
The aim of this study is to propose a Zolfaghari's FLR model with adjusting the degree of fitting (H) for estimating future manufacturing income. It is expected that the proposed model will prove to be the most optimal model that can be applied to the industries sector. Moreover, there are no assumptions to be considered before the model can be analyzed. The optimal model can be obtained by adjusting the degree of fitting (H) in order to find out the smallest error value.

PROPOSED METHOD
The research framework of this study is shown in Figure 1. The data were obtained from the Department of Statistics Malaysia (DoSM) from various industry sectors which include farming, fishing, mining, quarrying, manufacturing, construction, transport, and others [17]. Data filtering were performed accordingly and only manufacturing sector has been chosen for detailed analysis. The dataset has a total of nine explanatory variables including legal status (individual proprietorship, partnership, private, public, cooperative, others), ownership (Malaysian residents, non-Malaysian residents, joints), value of assets (total net book value), total employment, total salaries and wages paid, number of degree and above holder, number of diploma holder, number of Malaysian Certificate of Education (MCE) and below holder and total expenditure, while income is the dependent variable [17]. Once data filtering was performed, data pre-processing such as outlier treatment, normality and multicollinearity test were done accordingly. The tests are compulsory as it needed to fulfill the first assumption of the model. Next, the dataset was then fed into various algorithm such as multi linear regression (MLR), Tanaka's FLR and Zolfaghari's FLR algorithm with adjusted degree of fitting (H). Then, the obtained error values were calculated using several performance indicators such as mean square error (MSE), mean absolute error (MAE) and mean absolute percentage error (MAPE) [18]. The resulting MSE, MAE and MAPE were then compared in order to find out the best predicting model for the industries sector.

RESEARCH METHOD 3.1. Multiple linear regression (MLR)
MLR is an extension of simple linear regression [19] and is usually applied for many statistical analyses and is also known as a method that can evaluate the association among the dependent and independent variables [20]. MLR model has two key assumptions which are normality distribution and multicollinearity among explanatory variables. The Q-Q plot is used to identify normality distribution among the response variable and explanatory variables [21]. The next test is multicollinearity checking. It should be tested among explanatory variables by using variance inflation factor (VIF) to avoid any dependency between variables. The MLR model can be detailed as in (1).
Where r = 1, 2, N, the response variable is Y, the explanatory variable is 1 to , regression coefficient is 1 to . The least square method (LSM) is shown in (2). (3).
In LSM, the best fitting data is computed by minimizing S(β). Then, differentiate S(β) with respect to β where |̂ is equal to zero as in (4), then, LS estimator is shown in (5), the value of dependent of 0 + 1 1 + 2 2 +. . . + is represented by ̂, and the residual = −̂. Few previous researchers could explain the detailed of least square estimator method [19], [22].

Fuzzy linear regression (Tanaka)
In 1982, FLR was first proposed by Tanaka. FLR analysis aims to explore the potential models that fit observed fuzzy data. The difference of model is usually based on the fitting's formula. The final model of FLR is shown in (6). In order to obtain the fuzzy model, the estimation of the fuzzy parameters is done by solving a linear programming problem as in (7) [6], [23].

Fuzzy linear regression (Zolfaghari)
In 2014, Zolfaghari proposed a new extension of FLR toward Tanaka's model. There are two parameters that need to be considered in this model, either symmetric or asymmetric parameters. In addition, to determine the fuzzy parameters, the objectives of regression were done by following the linear programming method. This study will focus on the symmetric parameter. Under symmetric parameters, fuzzy coefficients are assumed as a triangular fuzzy number. The final model of FLR is also shown in (6). The linear programming problem is shown as in (8) [8].
Linear programming problem, subject to c ≥0 and, where H is a degree of fitting, is the spread and is the center from triangular fuzzy numbers in symmetric parameter.

Statistical performance measurements
A statistical formula is used to evaluate the results from analysis to generalize the ability in prediction models and prevent overfitting. The main use of the statistical performance measurements is to compute how precisely a projecting model will occur in real life [20]. There are various types of performance measurements that could be used in statistical analysis. In this study, three methods are shown.
Mean square error (MSE) is represented as in (9) (11), where is the real data ̂ is the predicted data N is observations number adjusting the degree of fitting (H) from 0 to 1. Then H value was selected by obtaining the smallest error value. Furthermore, the details discussion of the analysis will be elaborated by the following sections.

Multiple linear regression
MLR model was analyzed toward nine explanatory variables that contribute in predicting manufacturing income. Based on two early assumption tests, the first normality result is shown in Figure 2. The Q-Q plot in Figure 2 showed the data were normally distributed since the linear line is nearly straight. Next, the variance inflation factor (VIF) values were shown in Table 1 as a result for multicollinearity test. The result of the analysis indicates that all VIFF value of the explanatory variable is less than 10, which specifies that dependencies are not severe enough for multicollinearity situation [24], [25].
Futhermore, an analysis of MLR indicated that only six explanatory variables are significant toward manufacturing income as shown in Table 1. The significant of independent variables is determined if the pvalue < 0.05. Meanwhile, the correlation coefficient (r) is 0.993 and the determination coefficient ( 2 ) is 0.987, which indicates a strong positive linear correlation between X's and Y variables. The error values are shown in Table 2

Fuzzy linear regression (Tanaka)
In Tanaka's model, the degree of fitting in the FLR model has been adjusted between 0 and 1 to obtain the least error value as shown in Table 3. The best of MSE, MAE and MAPE values are 829657000000, 185519.8663 and 126.5645 respectively. The best model is shown as in (13) with H = 0.95 involving all explanatory variables. Table 4 shows the as fuzzy centre of a parameter and as the fuzziness of its parameter.

Fuzzy linear regression (Zolfaghari)
Zolfaghari's model is also used in predicting manufacturing income. The MSE, MAE and MAPE values are 575629000000, 154153.8335 and 104.6929 respectively as shown in Table 5. The best model is shown as in (14) where H = 0.025. Table 6 shows that the is fuzzy centre of a parameter and is the fuzziness of its parameter.     Figure 3 shows the plot of real and expected data for manufacturing income.

CONCLUSION
For the purpose of predicting manufacturing incomes, three models were applied: the MLR, the FLR proposed by Tanaka, and the FLR proposed by Zolfaghari. It appears that the FLR model introduced by Zolfaghari with H = 0.025 is the optimal model based on the MSE, MAE, and MAPE derived from all nine explanatory variables. By contrast, only six explanatory variables were significant according to the MLR. Since the MLR has the highest error values in comparison to the other methods, it cannot be used as a guide. Zolfaghari's FLR model indicates that manufacturing income is directly proportional to legal status, total salaries and wages paid, number of degree and diploma holders, number of SPM and below holders, and total expenditure. Additionally, manufacturing income is inversely proportional to ownership, asset value and number of employees. A manufacturer can use this output as a guide to improve their earnings.