Machine learning building price prediction with green building determinant

,

In the Malaysian real estate industry, GB is still in its infancy, where the valuation is not integrated into standard property valuation [7]. The valuation standard only provides the valuation of the property and buildings, which may not have sufficient definition to include GB development [8][9]. It creates some difficulties for the valuers to assimilate the conventional method of valuation to indicate and predict the price of GB accurately [10]. The problem arises as researchers see another problem related to real estate transaction data. It is stated that valuers often face difficulties to predict property prices over the time [11], especially when a matter related to limitation of data evidence transaction on GB valuation because GB development is relatively new in Malaysia and comparatively new in the real estate industry [12][13]. The valuation of non-GBs often depends on leasing or sales transaction data from several properties provided by JPPH and the data is unlimited. It is iimportant to realise that valuers face various challenges because of their heavy dependencyy on market data. Lack of data means lack of support for the valuable contributions of green attributes, which is supposed to be the factor influencing the GB price. Indeed, the real estate market is exposed to many price fluctuations due to existing correlations with many variables and some of which are beyond our control or perhaps unknown [14].
In light of this situation, Machine learning (ML) model has emerged as a very promising approach in resolving the issue and it is proven to be effective in different kinds of prediction and classification problem [15][16][17]. ML model has different kinds of algorithms and techniques to be selected for developing a good predictor model. These are beneficial to resolve dataset problems such as imbalance and insufficient data like the limitation of sale data evidence transaction of GB valuation. However, the accuracy of the results produced by the ML prediction model is highly dependent on many factors including the algorithms hyper-parameters tuning and different group of features selection. Thus, this paper is written with the aim to report the design and implementation of machine learning model based on auto hyper-parameters tuning and different groups of feature selection.
The contribution of this paper is two-fold. Firstly, it introduces the design and implementation of machine learning model with auto hyper-parameter tuning. In the methodology part, this paper provides the technique of auto hyper-parameter tuning by using best estimator function provided by Phyton Scikit-Learn library. Secondly, it presents how GB determinant affects the machine learning performance in predicting the price of building based on real dataset of Kuala Lumpur district in Malaysia The structure of this paper is as follows. Section II focuses on the background of the study related to the ML in real prediction of real estate and ML algorithms. Section III describes the research methodology followed by the discussion of the result in section IV. The concluding remark is written in the last section.

BACKGROUND OF THE STUDY 2.1. Machine learning for real estate prediction
Accurate evaluation of property price is crucial for real estate, the stock market, tax sector, the economy and the power of purchasers [18]. The conventional method is limited to the scope of current systems data that needs to be taken into account. Normally, predicting the price of property is often done through basic comparative market analysis as well as similar real estate in the same area to provide an approximate price for a particular property [19]. But in GB context, the other factors that can contribute or give positive impact or added values to the GB price should also be considered to produce an accurate result in the price and to reflect the current market value [20]. This will only happen if the valuer considers the historical factors in predicting the price of the GB. ML is seen to have the potential in considering those factors and problems [14].
The common ML modelling techniques that are already being implemented in real estate problems are Linear Regression [21][22][23], Decision Tree [24][25][26][27], Random Forest [21,[28][29], Ridge Regression [30] and Lasso Regression [24,31]. The function of all these algorithms is to predict the real estate dataset and the researchers test all these algorithms in order to predict the green building prices.

Machine learning algorithm
There are five (5) ML algorithms that are used in this study namely Linear Regression, Decision Tree, Random Forest, Ridge and Lasso algorithms.
Linear Regression (LR) is one of the most well-understood and well-known algorithms in ML and statistics. It is also a predictive model that mainly concerns in minimising the error and to ensure or to make the most accurate and possible prediction in explaining the dataset ability. The representation of LR algorithm is an equation that explains and describes a line which ensures the best fits of the relationship between the output variables (y) and input variables (x), by finding the exact weighting for the input variable that is called coefficient (B) [32]. The formula in (1) representing the Linear Regression algorithm.
In this formula, Y is the dependent variable (DV) by the given input (x) which is the independent variable (IV). The main goal of the Linear Regression algorithm is to find the value for the coefficients 0 and 1 [21][22]25]. Due to the simplicity of algorithm, Linear Regression has been commonly used in real estate prediction problem [13][14][15].
Decision Tree (DT) is another common model used to solve regression and classification problem [33]. The algorithm produces a tree structure that includes a root node and branches. Each internal node stands for a test on an attribute, each branch denotes the outcome of a test, which is called a decision node and each leaf node holds a class label which is called a terminal node. The topmost of the node in the tree is called a root node [33][34] as presented in Figure 1. However, previous research showed the designs which indicate that the DT algorithm can provide a higher accuracy to dataset, compared to the other algorithm like Lasso [24]. DT has no problems in approximating the linear relationships based on Independent Variable and Dependent Variable factors [25][26]. It is good to perform the algorithm when it comes to prediction.
The Random Forest (RF) is an advanced tree structures from the DT, [35][36][37][38]. It is a type of ensembled ML model called Bootstrap, Bagging or Aggregation. The bootstrap is a powerful statistical method for estimating a quantity from a data sample such as the mean. RF model will take a lot of data samples, calculate the mean, then average all of the mean values to give a better estimation result of the true mean value [39]. Several research have demonstrated that RF mostly outperforms many other algorithm in dealing with problem related to property price [21,[28][29].
The Ridge algorithm is one of ML models that is used for analysing multiple regression dataset that suffers from multicollinearity. Multicollinearity is also called as collinearity that refers to a position in which two or more informative variables in a multiple regression are highly related. Even though, Ridge Regression algorithm is added in that problem, a degree of bias to the regression can still be estimated. Ridge Regression is a model that enforces the coefficient to be lower but it does not enforce them to be zero, as it will not get rid of irrelevant feature but rather minimising their impact on the training model [40]. To avoid overfitting, Ridge Regression algorithm performs L2 regularisation stated in the formula. Meanwhile, Lasso algorithm uses L1 regularisation [41]. Equation (2) denotes Ridge algorithm.
In this formula, Y denotes for DV, X as IV and B represents the regression of coefficient to be predicted [40]. The represents the residual errors. There are some research which prove that Ridge Regression can be less performed compared to Linear Regression although the Ridge Regression is designed to handle multicollinearity in modelling house price [30]. In the other study on house price prediction, Lasso Regression has outperformed Ridge algorithm in handling multicollinearity. Furthermore, in real estate value prediction using multiple algorithm, Lasso regression algorithm seems to overfit their model dataset by using Ridge Regression algorithms [42].
Lasso regression algorithm stands for Least Absolute Selection and Shrinkage Operator and it can perform both tasks which are feature selection and regularisation. The only difference of Lasso algorithm from Ridge Regression algorithm is that the regularisation term is in absolute value. It is set to restraint the sum of the absolute values of the model parameters where the sum must be less than a fixed value [43][44]. Besides that, Lasso algorithm is being applied in a shrinking (regularisation) process where it penalizes the coefficients of the regression variables shrinking some of them to zero if they are not relevant. Indeed, this process is being applied to minimise the prediction error.
Research in [24] has demonstrated the potential of Lasso algorithm to produce higher accuracy than Linear regression and decision tree within the scope of study. The algorithm was employed in predicting the house price in Ames, Iowa in United State using real estate data from 2016 to 2020 and it was found that Lasso algorithm outperformed Ridge algorithm in this case [30]. The researchers also mentioned that Lasso is very useful for features selection and to eliminate any useless features.

METHODOLOGY 3.1. Dataset
The dataset is a collection of housing prices in 2018 with determinants that includes GB. As this paper uses machine learning prediction, these variables are called features. Table 1 shows the set of features to develop the machine learning prediction model. This study uses 18 features as independent variables (IV) for predicting the Transaction Price as dependent variable (DV).

Feature selection
The following Figure 2 shows the Pearson correlation between all features to the DV running with Python codes.
All the IVs have a very weak correlation to the Transaction Price. GB variable has the highest correlation among the features but 0.25 is considered weak. However, even with a very weak correlation, it was anticipated in the study that to some degree they still contribute impressive information to the model. There are several approaches in selecting the features for machine learning model. They can be divided according to the features correlation level or based on the feature's types or purposes. In this study, features were divided into three types namely without GB, GB and GB with other features.

Machine learning algorithms with auto-hyper parameter tuning
The five algorithms explained in part 2.2 namely Random Forest Regressor, Decision Tree Regressor, Ridge, Lasso and Linear Regression were used in this study. Prior to the prediction results prediction, auto hyper-parameter tuning was implemented first based on the training dataset by calling best_estimator method in the Python Scikit-Learn library. The method uses grid search optimization of hyper-parameter tuning on the given machine learning algorithm. This is the easieat and shortest time ways for inexpert data scientist to get the suggestions of parameters configuration for the algorithms.
The steps of implementing the auto hyper-parameters are as follow: 1. Call the regressor algorithm. 2. Create dictionary and define initial parameters for the algorithm with the corresponding set of values. 3. Call the grid search method by passing the created dictionary. 4. Do preliminary training for the algorithm with the grid search instance and get the parameters estimator. 5. Set the algorithm with the suggested parameters and conduct another fitness with the suggested parameters. 6. Perform another training with the suggested parameters. 7. Validate the prediction value produced by the algorithm and get the score.

Experiment configuration
In this study, the training and validation datasets were divided into the ratio 80:20 respectively. Python 3.6 Jupyter Notebook platform with Intel i7 7th Generation processor on 16 GB RAM were used. Each machine learning model with each algorithm was set to employ 80:20 percent ratios between the training and validation separation. Each model was run for five times of experiments and the average results of metrics were calculated for comparison. The metrics to present the performances of machine learning algorithms are R squared (R^) and root mean squared error (RMSE). The R^ can explain how well the selected features in predicting the dependent variable while RMSE represents the sample standard deviation the difference between the predicted and real values. The range of values for R^ is between 0.1 with higher is better. Meanwhile, RMSE with lower value shows lower errors or differences in the prediction results.

RESULT
The results are presented in different tables according to the three groups of features selection namely without GB, GB only and GB with other features. The average results from the five times experiments of each machine learning model were calculated and recorded. The results of model without GB features selection is presented in Table 2. Without GB determinant, only the Random Forest Regressor could produce an acceptable result. The algorithm had the lowest RMSE (0.027) and the highest coefficient of determination presented by the R^ (0.69). The mean of R^ from other algorithms appeared to be very weak but the errors distanced of each algorithm is considered promising. The following Table 3 presents the mean of R^ and RMSE for the tested algorithms with GB determinant only.
Similarly, Random Forest regressor outperformed other algorithms but the values for RMSE and R^ were not as good as the value in Table 2. The performances of Random model regressor dropped when only dependent on the GB determinant. However, not much different could be seen on the other algorithms. Lastly, Table 4 lists the results of each algorithm when tested with all determinants that combined GB and others. Combining GB with other features in the models does not really show a significant improvement to each of the tested algorithms. Slightly better performance can be seen on the Decision Tree regressor for the R^.

CONCLUSION
Within the scope of this study, it can be concluded that GB determinant has not contributed much to the performance of machine learning models even though its correlation to the building price is higher than the other determinants. Moreover, the worst results of all algorithms produced by the model with single GB determinant. Among the five selected algorithms, only Random Forest regressor shows a consistent performance with all the group of features selection. Therefore, Random Forest regressor can be further enhanced in future research for the same case of building price prediction. Suraya received her bachelor's degree in Computer Science, majoring in software engineering, from UTM. She later pursued her master's degree in Computer Science at UPM. She first started her career in industry when she was employed as an Associate Network Engineer by Ramgate Systems Sdn. Bhd in June 1996. She started her career as a full-time lecturer with UTM after receiving her master's degree. In three years of service at UTM she managed to complete two research projects funded by the university. She was offered a position in University Teknologi MARA (UiTM), Seri Iskandar, Perak, Malaysia, in 2004, which she gladly accepted and has been lecturing Computer Science subjects there until presently. In 2015, she received her PhD in Information Technology and Quantitative Sciences. In UiTM, she has so far managed to complete more than fifteen research projects and is currently active with three research grants.