IAES International Journal of Artificial Intelligence (IJ-AI)

Received Jan 28, 2022 Revised Jul 20, 2022 Accepted Aug 18, 2022 Today, the world lives in the era of information and data. Therefore, it has become vital to collect and keep them in a database to perform a set of processes and obtain essential details. The null value problem will appear through these processes, which significantly influences the behaviour of processes such as analysis and prediction and gives inaccurate outcomes. In this concern, the authors decide to utilise the random forest technique by modifying it to calculate the null values from datasets got from the University of California Irvine (UCL) machine learning repository. The database of this scenario consists of connectionist bench, phishing websites, breast cancer, ionosphere, and COVID-19. The modified random forest algorithm is based on three matters and three number of null values. The samples chosen are founded on the proposed less redundancy bootstrap. Each tree has distinctive features depending on hybrid features selection. The final effect is considered based on ranked voting for classification. This scenario found that the modified random forest algorithm executed more suitable accuracy results than the traditional algorithm as it relied on four parameters and got sufficient accuracy in imputing the null value, which is grown by 9.5%, 6.5%, and 5.25% of one, two and three null values in the same row of datasets, respectively.


INTRODUCTION
Machine learning [1]- [4] is the most exciting science today in the research community, which is characterised by its ability to design and develop algorithms that allow machines to learn [5], [6]. It is a subfield of artificial intelligence where the learning process consists of automatically extracting rules and patterns from a data file [7], [8]. Machine learning is closely related to fields such as data mining, statistics, pattern recognition, other things [9]- [11]. Supervised machine learning algorithms are illustrated by using new practices to predict future events and using what has been learned from past practices to recent data [12]- [15]. In addition, these algorithms analyse well-known scaling data through which they produce a function to make predictions about the output values, whereby the system can provide targets for any new input after adequate training [16]- [18]. Furthermore, machine learning algorithms can compare their calculated and accurate outputs to find errors in which the model can be modified accordingly [19]- [22]. One of the most classical machine learning techniques utilised for prediction is the random forest [23]- [25]. This technique is marked by being more flexible and straightforward to predict [26], as the forest consists of trees, and it is said that the more trees, the more influential the forest. In other words, the random forest generates decision trees based on randomly selected data samples [27], [28]. Then the predictions are got from each tree, and the best accurate result is chosen through voting, which is a good indication of the significance of this technique [29]. In general, this technique is employed for both classification and regression [30], [31]. The most critical issue that databases face is the existence of null value [32], as organisations rely heavily on the collection, storage, and analysis of this value for decision-making purposes. In short, a null value can be described as an empty field and means that the values are missing or unknown. Databases are a set of columns and rows that include data [33], but some of them will consist of a null or missing value [34]. Moreover, dealing with or knowing this value is not effortless as it may take a great time to realise it and understand its whereabouts [35]. As a result, databases suffer significantly from the problem of empty data that leads to inaccurate records and incorrect calculations, which leads to a return to the traditional manual method of data entry and therefore there will be a great effort and time in managing the database and consequently unreliable data will be obtained.
The foremost contribution of this scenario is to make different modifications to the random forest algorithm to impute the null value from five datasets gathered from the University of California Irvine (UCI) machine learning repository. The modification process depends on three main things (bootstrap with less redundancy, add features selection, and modified ranking stage) that are improved within the algorithm. Also, this scenario compares the modified algorithm with the algorithm without modification to know the performance of the two approaches in estimating null values and reaching convincing effects.

LITERATURE SURVEY
This section will address a bunch of literature involved in the random forest technique in solving a null values or missing values in large datasets. In a study executed by Sadiq et al. [36], they proposed using swarm intelligence and iterative dichotomiser 3 (ID3) techniques to solve the problem of null values in a large set of data. The intelligent swarm algorithm is used to feature selection that represents the bee's algorithm, while ID3 is used to find the statistics effects. This study makes a comparison between these two approaches for estimating null values; the outcomes indicate that the best performance is for ID3 in finding results without affecting the accuracy of the null value and no matter how much these values improved. Sadiq and Chawishly [37] executed the growth and improvement of the performance of the ID3 algorithm to solve the problem of null values in a large dataset. This investigation concluded, in the event of the happening of null values one and two with a row, the proposed system has the ability to estimate 99% of the null values, as well as if three null values appear within the row, the approximation is 97%, which are efficient and sound effects. In a study conducted by Ramosaj and Pauly [38], they suggested involving several techniques (stochastic gradient tree boosting, C5.0 algorithm, and random forest) in predicting missing values from credit information and Facebook data. The authors are able to develop these techniques to work more efficiently, as they are able to analyse the performance of obtaining continuous categorical and mixed data. It is concluded that the best performance was for the random forest as it gave high effects in finding the missing values in less time.
According to Salman et al. [39], they presented developing a random forest algorithm to increase its performance via meerkat clan algorithm to impute the missing value. After 100 iterations, the performance and accuracy of the random forest are good in calculating these values, but at 200 and 300 iterations, the execution becomes more complex. Increasing the block size in the modified algorithm improves the accuracy of null-value computation. This paper is characterized by the use of types of null values (categorical and numeric), which makes this piece more efficient. In a study by Jackins et al. [40], suggested that artificial intelligence techniques (naive bayes and random forest) be applied to predict diabetes, heart disease, and breast cancer. The database for this investigation is taken from National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), and all patients' data are from 21-year-old females. After running several experiments, it is found that a missing value is replaced with null values. The results of this study prove the ability of the techniques to remove the missing value and the efficiency of data classification. Another study executed by Gök and Olgun [41] collected blood samples from patients from Einstein Hospital in Brazil. They used them to predict the level of severity of COVID-19 utilising machine learning algorithms (decision tree, random forest, k-nearest neighbour, support vector machine classifier, gradient boosting, Gaussian naive bayes, multi-layer perceptron, Gaussian process). A set of missing data appeared during the work that affected the work, but they can use several approaches to fill the missing values, which are replaced with the most common value. This study got an accuracy of 0.98 from the random forest classifier.

METHOD
Random forest is a supervised algorithm [42]- [45]; by its name, its work is understood, and it makes a random forest that is its goal. It relies in its work on creating multiple decision trees and combining them to obtain a more accurate and stable prediction. In general, the more trees in the forest, the more elevated the algorithmic power. This algorithm is adaptable and effortless to utilise [46]- [48]; even without parameter modification, it produces an impressive and desirable effect most of the time. In addition, many recent studies have appeared in this technique; for instance, it is utilised to analyse x-ray images of COVID-19 patients and many other applications [49]- [52]. This algorithm is concisely employed for data analysis and predicting due to its simplicity [53]- [55]. Figure 1 illustrates the steps for creating a random forest classifier [56]. In the working steps of this algorithm, randomisation is added to the proposed model as the trees grow. The most useful property will be chosen among a random subset of the net properties at each step instead of searching for the most important property when dividing the nodes. Thus, a more acceptable model and a wide variety will be constructed.

Figure 1. Random forest steps
Moreover, this algorithm considers a random set of properties when partitioning nodes. For instance, using additional random thresholds generates more random trees for each function rather than searching for the best possible terms as a standard decision tree does. As mentioned earlier, random forests are a collection of decision trees [57], [58], but there are several discrepancies between one and the other. Besides, if a training data set with characteristics and labels is joined into a decision tree, it will formulate a set of rules that will be operated to create the predictions. For instance, in social networking sites, if want to predict whether a person will click on a specific advertisement, this is done by gathering information about the advertisement and the person who clicked on the advertisement in the past and some characteristics that describe his/her decision. If these characteristics are put in a decision tree, then some rules are designed to predict whether the ad will be clicked. The random forest selects observations and characteristics randomly to make many decision trees and then averages the effects. When decision trees are too deep, they can suffer from overfitting. On the other hand, random forests avoid over-adaptation most of the time, making random subsets of characteristics and making smaller trees employing these subsets, then merging the sub-trees later. This function slows down the work, relying on how many trees the forest randomly generate.
There are several important modifications for random forest algorithm in more than one side of it. In [59] random forest was modified by adding double feature selection to filter the relevant features. According to Fornaser et al. [60], a modified random forest algorithm called Sigma-z it is treat with two points the lack of any metrological characterization of the inputs passed to the model, such as the uncertainty of the data, and the lack of an assessment of the reliability of the results. Sigma-z consider the original classification structure, leaving it untouched, and the distribution of the training datasets. An overlaying structure statistically combines the two, and also includes in the process the propagation of feature uncertainties as a further element deriving from input measurements. In Mohsen and Sadiq [61], a ranked voting strategy based on accuracy values was proposed instead of classical voting, ranked voting based on the accuracy of each tree with different weights. Used one hot encoding as a representation method for the target of random forest, this technique gave good results compare with classical one [62]. The random forest algorithm is elected in this scenario for two major reasons: it is less inclined to overfitting than decision tree and other algorithms, and it's essential to demonstrate the significance of features. The overfitting phenomenon is more insignificant in the tree if the dataset increases, as a sufficient amount of data assists machine learning models in finding new patterns efficiently.

THE PROPOSED WORK
This scenario concentrates on modifying three critical points in the random forest algorithm: bootstrap with less redundancy, add features selection method, and modified ranking stage. Besides, bootstrap is a crucial stage in the random forest algorithm. In the modification step, a specific bootstrap strategy is based on decreasing the redundant of samples. Reducing the redundancy will increase the diversity of samples. Algorithm 1 illustrate the essential idea (steps) of bootstrap with less redundancy. This idea will guarantee a fair diversity of bootstrap samples that leads to different trees in the random forest. Moreover, to increase the performance of the random forest algorithm in the null-value estimation problem, the proposed modification of this algorithm concentrates on several steps. Features selection step plays a significant role to increase the accuracy of the random forest algorithm. Thus, the proposed modification will be making this step hybrid, it depends on the hybrid feature selection method. This method indicates that the selected features will be depending on more than one feature. Also, this method can be calculated with (1). From this equation, the random forest will be selecting the features depending on two feature selection methods. Thus, the selected features will be more powerful and relevant to the target.
Another modification is based on the ranking strategy of trees. In fact, the random forest algorithm before it is modified builds a set of tree classifications to assume the assumed outcome from the predictors. In addition, each tree is trained on a different specific sample of subjects with a random subset of tries predictors believed in every node from the tree. The primary purpose of random forest is to aggregate treelevel effects evenly across trees. In general, the traditional random forest algorithm is enforced for structuring forest trees, but the ranking is based on the undertaking of tree aggregation. Notably, every tree in the forest's ranking class 'votes' is believed. Thus, the superior-performing trees are ranked extra accurate. In other words, the ranking depends directly on the performance; its execution on another data set that is matching and differs in size will lead to calculating the bias prediction error rating. The data diverges originally into training and testing sets during the traditional performance of this algorithm in order to avert the bias while making trees on the bootstrap samples. By utilising the individuals of out-of-bag error, the predictive rating ability for each tree is calculated. In this scenario, the training data of ranking random forests included three quarters of the actual sample. Thus, approximately one half of the completed sample is in-bag in every tree, is employed to construct the tree, and one quarter is out-of-bag. Likewise, it is used to estimate tree implementation to calculate tree accuracy. Subsequent, the tree accuracy is calculated in the training data. Also, n trees are operated to gain votes for one quarter by observing independent test groups, where the votes (predicted classifications) over trees using ranking. Algorithm 2 illustrates the stages of the modified random forest algorithm with the ranking prediction for the class, which is based on every tree in it. The principal stages of this scenario are: − Stage I: no. of random records is accepted from the dataset having no. of records. The samples selected are founded on the proposed less redundancy bootstrap. − Stage II: Unique decision trees are created for each sample. Each tree has distinctive features depending on hybrid features selection. − Step III: Each decision tree will generate an effect. − Step IV: The Final effect is evaluated based on ranked voting for classification.

Dataset description and parameters
In this scenario, the proposed algorithm is executed on five datasets shows in Table 1. The first dataset is connectionist Bench include sonar, mines vs. rocks dataset [63]. The assignment is to train a network to determine sonar signals reflected off a metal cylinder and those reflected off a roughly cylindrical rock. This dataset includes files; the first is "sonar. mines" consists of 111 patterns achieved by bouncing sonar signals off a metal cylinder at different angles and under other circumstances. The second is "sonar. rocks" with 97 patterns earned of rocks under the equal status. The transmitted sonar signal is a frequencymodulated chirp, growing in frequency. Moreover, this dataset is characterised as the signals from the collection of different part angles, travelling 90 degrees for the cylinder and 180 degrees for the rock. In addition, every pattern in this dataset consists of a set of 60 numbers between the scopes of 0.0 to 1.0. Also, every number symbolises the energy within a characteristic frequency band, integrated over an express length of time. The integration aperture for heightened frequencies materialises later since these frequencies are subsequently transmitted during the chirp. The label connected with every record includes ( ) if the object is a rock while ( ) if it is a mine (metal cylinder). On the other hand, the labels' numbers are in growing order of factor angle, but the angle is not encoded directly. The second dataset [64] is data collected from phishing sites, namely phish tank archive, Google searching operators, miller smiles archive while the third dataset is breast cancer Wisconsin [65]. The fourth dataset is Ionosphere dataset classification of radar returns from the ionosphere [66]. Finally, the fifth dataset is COVID-19 pandemic. There are several parameters in the proposed modified random forest algorithm for null-values imputation. Table 2 includes each parameter's ranges value. In this scenario, four feature selection methods have been utilised in the experiments: Information Gain, Gini Index, Chi-Squared and Correlation.

The effects and discussion
Several experimental results have been conducted to test the proposed algorithm within ranges of parameters in Table 2 Several experiments selected two important feature selection methods (Information Gain and Gini Index) within different weight values. Directly, the effects of this scenario will be given. Matter I: loss 1 value in each row, the experimental results performance is exhibited in Tables 3-5. Matter II: loss 2 values in each row, the experimental results performance is exhibited in Tables 6-8. Matter III: loss 3 values in each Row, the experimental results performance is exhibited in Tables 9-11. Also, the original random forest algorithm runs on the same dataset matters. Table 12 displays the most acceptable results of the proposed work compared with the original random forest.      Undoubtedly, the problem of null values is one more complex problem for several reasons such as: i) Weakness of datasets because of no real associations among the attributes or features of these datasets; ii) Weakness of some null values associated with the target or other completed attributes/features; iii) Little completed data compared with the size of null values; and iv) The nature of the dataset, for instance, hasn't a strong association or relevance between the features and target, even among the features.    Through the above reasons, some results are unsuitable or don't meet ambition in predicting effects. In this scenario, the most profitable results have been obtained through the number of trees =20, W =0.5, threshold =0.7 and the two feature selection methods (information gain and gini index). The performance of the modified random forest results increased by 9.5%, 6.5% and 5.25% of 1, 2 and 3 null values, respectively. The results depended on average values for the five datasets. Besides, the nature of the dataset plays a significant role in increasing the accuracy of null-values estimation. In addition, one null value imputation gave a good result for all the five datasets, two null values gave less than one null value, and three null values showed minor effects. The breast cancer dataset gave the best results compared with the four others. Connectionist bench, Phishing websites, and Ionosphere datasets gave inadequate effects within two and three null values. While the performance with COVID-19 is not satisfactory.

CONCLUDING REMARKS AND FUTURE DIRECTION
The modified random forest algorithm focuses on three modifications to increase the performance of the original one, less redundancy bootstrap, hybrid features selection and ranked voting. These three modifications made the random forest algorithm more efficient by selecting diverse samples using less redundancy bootstrap and more than one feature selection method to enhance the selected features more relevant to the target. Lastly, the voting strategy is based on ranking the trees. Also, these three modifications on the random forest algorithm gave enhanced results compared to the original one. The experimental results for the five datasets showed significant improvement in outcomes by 9.5%, 6.5% and 5.25% for one, two, and three null values, respectively. In the null values imputation problem, increasing the number of missing values decreases the imputation accuracy. Also, the nature of the dataset plays a significant role in the imputation; some dataset does not contain relational relevance in their attributes, which causes poor extracted learned rules. Unfortunately, these inadequate, learned rules don't enough to estimate the missing values. In the future, other machine learning techniques will be applied to solve the situation of null values in the same datasets.