Spatial decision tree model for garlic land suitability evaluation

Received Nov 4, 2020 Revised May 19, 2021 Accepted Jun 14, 2021 Predicting land and weather characteristics as indicators of land suitability is very important in increasing effectiveness in food production. This study aims to evaluate the suitability of garlic land using spatial decision tree algorithm. The algorithm is the improvement of the conventional decision tree algorithm in which spatial join relation is included to grow up spatial decision tree. The spatial dataset consists of a target layer that represents garlic land suitability and ten explanatory layers that represent land and weather characteristics in the study areas of Magetan and Solok district, Indonesia. This study generated the best spatial decision trees for each study area. On Magetan dataset, the best model has 33 rules with 94.34% accuracy and relief variable as the root node, whereas on Solok dataset, the best model has 66 rules with 60.29% accuracy and soil texture variable as the root node.


INTRODUCTION
The difficulty of Indonesia for garlic self-sufficiency is considered because of the lack of garlic farmland which causes the inability to fulfill the garlic consumption needs of Indonesian people so that is the main focus of the Indonesian government to be completed in 2019 [1]. The garlic farmland only reached 2,407 hectares (ha) in 2016 [2]. This figure even decreased by 6.09 percent compared to garlic farmland which was recorded as covering an area of 2,563 ha in 2015. Plant production land that did not grow significantly became one of the main causes of not being able to meet needs. This problem can be solved by increasing the effectiveness in food production by utilizing technological advances for land suitability evaluation modeling [3].
Previous research has discussed a lot about land suitability for various agricultural commodities. A knowledge-based system for evaluating physical land suitability for 45 cultivated plants based on fuzzy inference [4]. In its development, land-use suitability mapping and analysis is one of the most useful applications of geographic information systems (GIS) for spatial planning and management [5]. The conventional method of overlaying maps was widely used in the field of land suitability evaluation by integrating multi-criteria-decision-analysis/making (MCDA/MCDM) methods and GIS technology [6]- [9]. The artificial intelligence methods played an important role in the development of land suitability evaluations and can solve the problem of multi-index decision-making method that different multi-index analyzes can produce different evaluation results [10]. The GIS-MCDA/MCDM technique was combined with AHP as computational weights or level of influences of criteria to evaluate land suitability [10]- [18]. The limitations Int J Artif Intell ISSN: 2252-8938 Spatial decision tree model for garlic land suitability evaluation (Andi Nurkholis) 667 of those studies are inherent issues with the use of AHP method, namely the inconsistency of expert judgment [13].
Previous land suitability studies have not considered spatial data correlation for each variable/factor. Analysis of geographically referenced data was done in this case land suitability, it is essential to consider the correlation of spatial data (i.e., position, distance, and orientation) [19]. Land suitability evaluation can be accomplished using the classification method; this is due to the presence of existing garlic land suitability in spatial data based on assessment of land suitability uses provisions [20], namely S1 (highly suitable), S2 (moderately suitable), S3 (marginally suitable), and N (not suitable). Classification is a data extraction technique in which data stored in the database is analyzed to find rules that describe the partition of the database into a particular class set [21]. Spatial datasets for classification tasks are composed of some explanatory layers which in this study are ten garlic planting criteria and one target layer which is used to represented garlic land suitability class. Each layer represents a set of spatial objects which is characterized by several spatial (polygon, line, and point) and non-spatial attributes (label). One of non-spatial attributes in an explanatory layer is the explanatory attribute that identifies objects in the layer. The target layer has a target attribute that stores class labels of the target object. In a non-spatial dataset, target classes are discretevalued and unordered (categorical) and explanatory attributes are categorical or numerical. In this study, spatial classification is used to extract rules that split a spatial dataset consisting of classified objects into a number of classes based on non-spatial and spatial properties, as well as spatial relations of the classified objects to other objects.
This study developed a classifier for evaluating garlic land suitability using spatial decision tree algorithm. The algorithm is a development by [19] an extended from ID3 algorithm [22] which has been used by [23]- [26] on the classification of spatial data to predict the occurrence of fires based on hotspots with fairly good results of accuracy, i.e., 74.72%, 87.69%, 75.66%, and 71.66%. In a recent study [26], the comparison between classification algorithms that involving spatial factor (spatial decision tree) and not involving spatial factor (ID3, C4.5, logistic regression) was carried out, showing spatial algorithm produces a model with better accuracy. The formula of entropy and information gain in the algorithm were modified by involving two types of spatial relationships namely metric and topological to relate two spatial objects. There are two spatial relationships used, namely 'in' with 'count' as spatial measure value and 'distance' with 'distance' as spatial measure value [23]- [27]. The relation 'in' is used when a target layer is represented by point feature and an explanatory layer represented by polygon feature, while the relation 'distance' is used when a target layer represented by point feature and an explanatory layer represented by point/line features. For example, when point feature as target layer is correlated to polygon feature as explanatory layer, then the way to get the spatial measure is calculated by counting the amount of target data in an explanatory layer. Instead of using point feature as target layer, both of target layer and explanatory layer in this study consist of polygon features. So that the spatial relation used will be different from previous research [23]- [27]. The spatial relationship proposed in this study is to measure the intersection area between the target layer area and the explanatory layer area.

RESEARCH METHOD
The study area is Magetan district, East Java province with an area of 70,143 ha [28] and Solok district, West Sumatra province with an area of 335,086.53 ha [29]. The two districts are predicted to be the center for producing garlic for Indonesia in the future [1]. The data used in this study are ten garlic planting criteria as explanatory layers and a garlic land suitability as target layer for each district. Seven spatial criteria are vector format collected from Indonesian Center for Agricultural Land Resources Research and Development (BBSDLP). Those criteria are drainage, relief (%), base saturation (%), cation exchange capacity (cmol), soil texture, soil pH (°), and depth of soil mineral (cm). The three non-spatial criteria are rainfall (mm) and temperature (°c) obtained from meteorological, climatological, and geophysical agency (BMKG), while elevation (masl) in raster format acquired from United States geological survey (USGS). The non-spatial criteria need to be pre-processed before they can be integrated with other spatial data [24]. This study was conducted in several stages, i.e., preprocessing data, spatial decision tree classification, and classification evaluation. The following three preprocessing data stages were carried out in this study: − The first step of preprocessing data is interpolation on rainfall and temperature data that produce two layers of rainfall and temperature in vector format. Interpolation is a mathematical method or function that predicts values in locations where data are not available or not obtained. The comparison of rainfall interpolation methods involving elevation factor (i.e., ordinary co-kriging) and does not involve elevation factor (i.e., ordinary kriging and kriging with external drift) [30]. The result of those study was ordinary co-kriging the best interpolation method in estimating the distribution of rainfall values with the lowest error; therefore, this method is used in this study. Variables included in spatial interpolation are rainfall/temperature as primary and elevation as secondary variables in the ordinary co-kriging method.

668
− The second step of preprocessing data is extracting topographic data contained in the digital elevation model (DEM) data to produce an elevation layer in vector format. The use of DEM as a source of elevation data is possible because DEM is a quantitative, three-dimensional representation of the earth surface derived from elevation data [30]. A DEM is typically given in one of the three formats: the raster-based grid DEM, the vector-based triangular irregular network (TIN) and contour-based storage structure [31]. DEM type used in this study for making the elevation layer is the raster-based grid DEM acquired from USGS. − The final step of preprocessing data is to verify the validity of each explanatory and target layer. The cause of invalid geometry in the polygon is self-intersection. Self-intersection status indicates that polygons do not meet the requirements of OpenGIS [32] so that polygons cannot be included in the spatial decision tree classification. The explanatory and target layer containing invalid geometry were repaired by deleting a small portion of the invalid polygon.

Spatial relationship, spatial entropy, and spatial information gain
Spatial data mining aims to discover hidden knowledge from spatial databases by combining spatial and non-spatial properties that accumulate in spatial systems such as geographic information systems [33]. Spatial data mining method is the development of those used in conventional data mining [34]. Spatial data mining has two functions [35]. The first function discusses spatial phenomena by exploring data. For example, this study is identifying land suitability by determining the spatial distribution of the location of land and weather characteristics. The second function explains or even predicts phenomena by discovering multiple relationships. For example, in this study land suitability can be 'explained' by land and weather characteristics at that location.
Spatial data represents real objects based on the earth geographical [26]. The objects are represented by using geometric such as point, line, polygon, and pixel. Objects in spatial data have spatial relationships with its neighbor which used in this study is topology. Topology is a spatial relation that deals with the various geometric shape which in this study is a polygon. A relation between spatial objects of two different layers is essential in spatial data mining systems [26]. Spatial relationships allow to include relations between two spatial objects in a dataset for a classification task. Spatial relations between two layers could produce quantitative values in the form of distance between points or area in the intersections of two polygons [23]. The explanatory layer and target layer used in this study are both represented by polygons so that the spatial relationship proposed are intersection between the target layer area and the explanatory layer area. We denote these quantitative values, i.e., area as spatial measure of spatial relationships between two objects. This spatial measure is used in the spatial entropy formula which replaces the number of tuples in a partition in the non-spatial entropy formula. Illustration of the intersection area between an explanatory layer and a target layer can be seen in Figure 1. Let L is a set of layers; L i and L j are two distinct layers in L. A spatial relationship applied to L i and L j is denoted SpatRel(L i , L j ) that can be topological relation or metric relation [23]. For example there are L i (some explanatory layers) and L j (a target layer), i ≠ j, i = 1,2,…,p and p is number of layers in L i , j= 1,2,…,q and q is number of layers in L j which in this study is only one, for feature r i with R= SpatRel(L i , L j ), spatial measure for r i is denoted by SpatMes(r i ). In this study, a new equation is formulated to measure SpatMes r in (1). SpatMes(r) = f (SpatMes(L i1 ∩ L j1 ), SpatMes(L i2 ∩ L j2 ), … , SpatMes(L im ∩ L jn )) (1) where f : sum function m : number of polygons in L i n : number of polygons in L j In spatial database, a layer is represented as a relation and applying a spatial relation between two layers results a new relation. A spatial relationship is applied to L i and L j in L results a new layer R. A spatial join relation (SJR) for all features p in L i and q in L j is formulated in (2) [23]: SJR = {(p, SpatMes(r), q|p in layer L i , q in layer L j , and r is feature in R associated to p and q} Let a target attribute C in a target layer S has distinct classes (i.e., c 1 , c 2 , … , c l ), spatial entropy for S represents the expected information needed to determine the class of tuples in the dataset which is formulated in (3) The variable with highest spatial information gain is selected as the first node in the spatial decision tree known as root. The next node is sequentially filled with variables with lower gain value. Spatial decision tree will stop growing if they fulfill one of the following termination criteria [23]: − Only one explanatory layer in L. In this condition, the algorithm returns a leaf node labeled with the majority class in the SJR for the best layer and the explanatory layer. − The SJR for the best layer and explanatory layer contains the same class . Then the algorithm returns a leaf node labeled with the class .

Spatial decision tree
The spatial decision tree technique uses the basic concept of a decision tree which is a tree structure, where each node in a tree represents a variable, each branch represents the attribute value, and the leaf node represents a certain class [36]. Whereas the spatial decision tree is a rooted tree that meets the following criteria (1) each internal node is a decision node over a layer, (2) each branch denotes an outcome of the test and (3) each leaf represents one of the class values [19]. Figure 2 shows our proposed algorithm to generate a spatial decision tree that has been developed [23]. Algorithm inputs are divided into two groups: (1) a set of layers containing some explanatory layers and one target layer that hold class labels for tuples in the dataset and (2) spatial join relations (SJRs) storing spatial measures for features resulted from spatial relations between two layers. The algorithm generates a tree by selecting the best layer to separate the dataset into smaller partitions as pure as possible meaning that all tuples in partitions belong to the same class.
The algorithm works on spatial data stored in a spatial database [26]. When the algorithm is applied to the data in the database, some new layers are produced as the result of spatial relations between two distinct layers. These new layers are created from existing explanatory layers and the value v i of predictive attribute in the best splitting layer [26]. The value v i is a selection criterion in the query to relate an explanatory layer and the best layer. The new layers are then used in calculating spatial information gain at the root branch which produce node internal/leave to compile spatial decision tree that illustrated by Figure 3.  Step by step of spatial decision tree

Confusion matrix
Classification accuracy is done by testing the classification rules on the actual data so that they can be corrected in subsequent iterations. The higher level of accuracy, the lower classification error in the test data. Accuracy was obtained from Magetan and Solok data using confusion matrix in (6) [37].
where tp (true positive) : number of positive data that is correctly classified tn (true negative) : number of positive data that is incorrectly classified fp (false positive) : number of negative data that is correctly classified fn (false negative) : number of negative data that is incorrectly classified.

RESULTS AND DISCUSSION
The results of preprocessing data produce ten explanatory layers and one target layer, all of which are ready to be used for spatial decision tree classification. All explanatory layers and a target layer are stored Algorithm: Generate_SDT (Spatial Decision Tree) Input: a Spatial dataset D, which is a set of training tuples and their associated class labels. These tuples are constructed from a set of layers, P, using spatial relations. b A target layer S ∈ P with a target attribute C. c A non empty set of explanatory layers L ⊆ P and L ∈ L has a predictive attribute V. P = S ∪ L. d Spatial Join Relation (SJR) on the set of layers P, SJR(P), as defined in (2). Output: A Spatial Decision Tree Method: 1 Create a node N; 2 If only one explanatory layer in L then 3 return N as a leaf node labeled with the majority class in D; // majority voting 4 endif 5 If objects in D are all of the same class c then 6 return N as a leaf node labeled with the class c; 7 endif 8 Apply layer_selection_method(D, L, SJR(P)) to find the "best" splitting layer, L*; 9 Label node N with L*; in a spatial database to be executed by the algorithm in Figure 2. List of layer names, number of polygons for each layer, and attribute names in a layer can be seen in Table 1.

Spatial decision tree for land suitability
In this study, 3 models were made for each of Magetan and Solok data with the aim of obtaining the best rules. The models created based on Magetan data is denoted by A, while the models made based on Solok data is denoted by B. The description of the differences in the variation of the Magetan and Solok models can be seen in Table 2. The model variations for Magetan can be seen in Table 3, while the model variations for Solok can be seen in Table 4. Optimization of A0 and B0 model adding the condition of spatial relation result ≥ 1 ha in SJR process, it is based on the smallest garlic farmland is 1 ha [38], so that we assume the area < 1 ha to be considered not representing the land suitability class which is related to explanatory factor A2 and B2 Optimization of A1 and B1 model deleting planting area of <1 ha in new layer process, this is based on we assume that polygons that have an area below 1 ha will produce spatial relation result under 1 ha as well Table 3. Spatial decision tree model of Magetan   Based on Tables 3 and 4, it can be concluded as: (i) Optimization of adding a condition to the SJR process has positive impact i.e., fewer layers and rules generated and provide information about the variable that is not involved in spatial decision tree result. Based on the model result, the variable that is not involved meaning that the variable is not so important in the garlic land suitability. That variable is drainage, due to the variable have variations data are very diverse, so the model cannot determine land suitability class of the variable. (ii) Optimization of deleting the planting area in the new layer process has positive impact i.e., fewer rule results and no rules that have no class. ISSN: 2252-8938

Spatial decision tree evaluation
The evaluation was carried out on 6 models that have been produced using two testing datasets, namely Magetan and Solok. Evaluation was done by implementing confusion matrix in (6) on the results of applying the rules to the test data. The evaluation results can be seen in Table 5. 60.29 *Number of data that is correctly classified by the rules **Number of data that is incorrectly classified by the rules ***Number of data that cannot be classified by rules Based on Table 5, it can be concluded that optimization of adding a condition to the SJR processes affect the result of higher accuracy, while optimization of deleting the planting area in the new layer processes do not affect the result of accuracy. Overall, the Magetan and Solok models provide good result if tested using training data, for example, the Magetan model is tested with Magetan data and the Solok model is tested with Solok data. However, if the model is applied to other district data, the accuracy result is decreased. This is probably due to differences in characteristics that are quite significant in the two districts, so the result in one district can only represent the rules for the district itself. This is proven by: − The amount of unclassified data results was higher than those classified when the Solok model is tested on Magetan data and vice versa. This can be seen in A0 and B0 model when tested with other district data, unclassified data results are higher than true and false data results. − Variable resulted as root node differs between Magetan and Solok model. The entire Magetan model has a relief variable as a root node, while the entire Solok model chooses the soil texture variable as the root node. − Some attribute values of a variable in Magetan data are not owned by the attributes in Solok data and vice versa. The temperature in Magetan only ranges from 23-25°c, while the temperature in Solok ranges from 25-26°c. It means that the Solok testing data containing temperatures variable with a value of 26°c cannot be classified by the Magetan model which only has a temperature range of 23-25°c. Based on the model evaluation, it can be concluded that the best Magetan model is A2 model with a higher accuracy than A0 model and fewer rules than A1 model. Whereas the best Solok model is B2 model, with higher accuracy results than B0 model and fewer rules than B1 model. Here are some example results of A2 model rules: − IF relief = steep AND elevation = slightly low AND soil pH = slightly acid AND depth of soil mineral = deep AND cation exchange capacity = medium THEN garlic land suitability class = S1 (highly suitable) − IF relief = steep AND elevation = slightly low AND soil pH = slightly acid AND depth of soil mineral = medium THEN garlic land suitability class = S2 (moderately suitable) − IF relief = slightly flat AND rainfall = slightly low THEN garlic land suitability class = S3 (marginally suitable) − IF relief = steep AND elevation = slightly high AND temperature = 24°c AND cation exchange capacity = low AND rainfall = slightly high AND depth of soil mineral = deep THEN garlic land suitability class = S1 (highly suitable) − IF relief = slightly flat AND rainfall = slightly high THEN garlic land suitability class = S2 (moderately suitable) − IF relief = flat AND rainfall = slightly low AND depth of soil mineral = very steep THEN garlic land suitability class = S3 (marginally suitable) − IF relief = slightly flat AND rainfall = slightly high THEN garlic land suitability class = S2 (moderately suitable) Visualization is applied to the best spatial decision tree rules, where A2 model is implemented in Magetan data while B3 model is implemented in Solok data. The following is the land suitability visualization in Magetan and Solok district which can be seen in Figure 4.

CONCLUSION
This work applied the spatial decision tree algorithm on the spatial garlic land suitability dataset in the study areas Magetan and Solok district, Indonesia. A spatial dataset is composed in a set of layers in which the layers are divided into two categories i.e., explanatory layers and a target layer. The explanatory layers are ten planting garlic criteria, i.e., elevation, drainage, relief, base saturation, cation exchange capacity, soil texture, soil acidity, mineral soil depth, rainfall, and temperature. A target layer is garlic land suitability that has three classes i.e., highly suitable, moderately suitable, and marginally suitable. The result is two best spatial decision trees for land suitability evaluation. Magetan model has 33 rules, accuracy of 94.34%, and relief variable as the root node, while Solok model has 66 rules, accuracy of 60.29%, and soil texture variable as the root node. The variable that is not involved in two best spatial decision trees is drainage, meaning the drainage variable is not so important in determining garlic land suitability. The two best spatial decision trees have weaknesses which when tested with other district data then the accuracy result will decrease, due to differences in characteristics of the two districts. Future works are expected: (1) Development of land suitability geographic information system as an interactive map visualization. (2) Adding land cover factor to get land suitability rules with more specific land characteristics, so as not to damage the specified land order, such as protected forests.