Novel approach of association rule mining for tree canopy assessment

Received Oct 30, 2020 Revised May 19, 2021 Accepted Jun 4, 2021 The evolution of technology and availability of voluminous satellite images are bringing a new scenario in satellite image classification where a performance efficient method for predictive analysis of satellite images for land cover classification needs to be devised. As urban areas are growing at faster rate, special attention needs to be given to solve tree canopy assessment problem. Vegetation indices are calculated from spectral information of satellite images. Hundreds of such vegetation indices are available to detect vegetation from a satellite image. The contribution of this paper is designing an improved Apriori algorithm to select optimal number of vegetation indices for tree canopy assessment. In this research, we propose a novel computational approach that allows the improvement of results. It selects optimal combination of vegetation indices and applies principal component analysis on it. It uses a greedy approach based on Apriori algorithm. This study emphasizes on assessment of tree canopy using GPU-enabled environment for performance-efficient assessment. The results achieved, are comparable to state-of-the-art techniques, with an accuracy of 96%. The research has considered 4 years data for Mumbai city of India. This research is useful for Green India Mission of India to assess tree canopy of urban region.


INTRODUCTION
National mission for a green India is a mission of India to enhance the tree canopy of the city to reduce pollution. It is also called as green India mission (GIM). It aims at highlighting the challenge of environmental change. It is one of the eight Missions illustrated under development activity of India. Tree is an important part of the environment and human life. It is difficult to assess the tree canopy in an urban area like Mumbai because of rapid change and industrial developments. Through the green India mission, the government is targeting to solve problems like deforestation and restoring tree canopy. This research is trying to address the problem of tree canopy assessments with the help of satellite image classification in an urban area. Biological evaluation of an urban tree canopy gives significant data to the urban arranging and the board that can be utilized to ensure and upgrade environmental benefits in the urban tree canopy. The main challenge in this field is the availability of multispectral remote sensed satellite images with high resolution and computing platforms. The prime motivation of this research is to assess the tree canopy which is one of the aims of green India mission. The tree canopy assessments are done here with the help of a high performance platform on satellite images. Tree canopy assessments is a problem which is a satellite image classification problem. So, the problem is to solve the tree canopy assessments problem using satellite images on a high performance platform to get good accuracy. This research article is planned as: Literature survey for vegetation analysis is available in section 2. It includes a summary table of research papers that have used Sentinel 2 as a dataset for vegetation analysis. Section 3 talks about the mathematical model, dataset, preprocessing of the sentinel-2 dataset, information about study area, the computing platform used and machine learning algorithm. Section 4 explains the results obtained after implementing the classification algorithm LibSVM and results are discussed in it.

LITERATURE SURVEY
The tree canopy assessments in an urban area are a challenge as there are networks of roads and buildings. Tree canopy assessments are a satellite image classification approach [1]. It has attracted many researchers as it affects the life of human beings. Satellite images available on the internet can not directly be used to analyse the tree canopy. So, this research involves classification on satellite images to assess tree canopy. The satellite images are explained with its spatial, spectral, and temporal resolution [2]. A typical optical satellite image consists of a number of spectral bands. The spectral vegetation indices are ratio of band information. The vegetation indices are calculated using spectral information of optical multispectral satellite images. Bannari et al. [3] forty such indices are available for analysis of vegetation of a geographical area. Tree canopy assessments and analysis involves phenological analysis using vegetation indices [4]. A fusion of images from Sentinel 1 and optical features of satellite images of Sentinel 2 provided a break in vegetation phenology analysis for vegetation management. Stendardi et al. [5] and Heckel et al. [6] proposed a novel idea of correlation analysis of data from Sentinel 1 (VV and VH) data to phenological vegetation analysis for the South Tyrol area. The support vector machine (SVM) approach [7] is proposed for tree canopy assessments using images of sentinel-2 [3]. It also focuses on the strength of the Sentinel-2 images for the assessments of tree canopy. The geographic area of study was the forest of Knyszyn, and forest Landscape Park in Poland. Wang et al. [8] did a study on the difference calculation between two images. The resultant difference image is utilized as the input for supervised classifiers. The classifiers which are used in this paper were SVM, K-nearest neighbour, ensembled methods, and random forest. The results of these classification algorithms are combined using an ensemble based method. Change detection is identified using a voting method which is a weighted method. Spectral vegetation indices calculated from spectral information of satellite images is one of the prominent tools. Table 1 shows the brief information of the techniques which used Sentinel 2 dataset. Using texture or spatial information for classification using Sentinel 2 data. [7] Normalized difference vegetation index (NDVI) is used for vegetation analysis. Wang et al. [14] uses leaf area index (LAI) for vegetation analysis with remotely sensed satellite imagery. The tree canopy analysis is carried out using supervised [15] as well as unsupervised machine learning algorithms [16]. Object based and pixel based techniques are also available under supervised machine learning algorithms. Object based supervised classification is said to be better than pixel based classification method and high accuracy in less computation time is observed. Xue et al. discusses on hundred different types of vegetation indices. Every vegetation analysis is associated with one or more applications according to the vegetation of interest and environmental conditions with statistical implementation and precision. Vegetation indices are also applied on hyper spectral images and UAV platforms.

PROPOSED METHODOLOGY
The proposed method discussed in this research paper is based on an intelligently selecting optimal number of vegetation indices for the tree canopy assessments. The novelty of the algorithm involves the optimum number of principal component analysis (PCA) components on the image with selected bands and the optimum number of vegetation indices to achieve better accuracy of the classification of Sentinel 2 773 dataset. Figure 1 shows the steps involved in the process of classification of Sentinel 2 dataset for the tree canopy assessments. Figure 1 describes the steps followed in the process of the tree canopy assessments with the help of Sentinel 2 dataset for Mumbai and suburban regions. The proposed system works in several stages. It acquires satellite images for the Mumbai region for a particular time period. Then the preprocessing is carried out on this data to remove cloud and noise information.

Algorithm for optimum indices selection
The intelligent module based on modified approach is used to calculate vegetation indices and select the optimal number of vegetation indices using an intelligent algorithm [17]. It is a greedy approach; it makes optimum selection at each step as it tries to find the effective combination of vegetation indices. Algorithm 1 explains the steps carried out to select optimum number of indices from the set of indices. The algorithm starts with the set of indices and their corresponding accuracy. These accuracies are calculated for individual vegetation indices or taken from literature survey. A minimum threshold is selected by reviewing the literature related to vegetation indices. The algorithm has two stages joining and pruning. Each combination generated at each stage goes through a pruning stage. In pruning stage, the accuracy is compared with minimum threshold. In pruning stage unwanted vegetation indices combination is removed by comparing it with minimum threshold. In joining stage, a new candidate set is generated by joining pervious state candidate set with itself. The PCA is used for dimensionality reduction. It is applied to reduce the input given to the classifier. The selection of an appropriate number of coefficients is done to achieve dimensionality reduction. LibSVM is applied on it to get the desired output of classification. The research uses SVM for classification. This is a binary classifier which returns an appropriate class of the pixel of a satellite image. The algorithm used for optimal indices selection for classification of Sentinel 2 dataset for Mumbai region can be explained with the help of a pseudo code as. The intelligent module used in the proposed model works by selecting an appropriate number of vegetation indices from the given set of indices. The function selects indices based on the logic of Apriori algorithm of association rule mining algorithm which uses the threshold (accuracy) to select appropriate number of indices. The algorithm removes the drawback of association rule mining of Apriori algorithm. Apriori algorithm is used to find the frequent itemset [18]. It is easy algorithm for finding association rules from given set of data items. But it has two major drawbacks: First is multiple scans of dataset and second is too many candidate sets are generated. This research has addressed first issue of Apriori algorithm of association rule mining [19]. Multiple scans of dataset are avoided by gathering enough information from the literature survey about the accuracy of different vegetation indices. Critical literature survey is done to calculate the threshold for this algorithm. So, it is going to avoid multiple scans of dataset. Figure 1 shows the functioning of this algorithm to find the optimum number of vegetation indices. PCA transforms images into a group of bands [20]. A dataset has many features and many of them are correlated. PCA reduces the number of bands in the feature space. Thus, it reduces computational complexity. It takes fifteen bands including vegetation indices bands of an image as an input and produces fifteen bands as an output. This research checks the number of principal components from the output of PCA by measuring the variance ratio of the principal components. This research paper uses a supervised classification [21], SVM classifiers are used for supervised classification. It maps classes to pixels. SVM maximizes separation between classes by using training dataset and annotates pixels by examining their closest class in feature space. This research involves binary classification.
The training dataset is divided into two classes i.e., a tree or non-tree region. Training dataset is created with the help of the Google earth engine (GEE) [22]. This research involves the use of LibSVM which is one of the libraries available for implementing SVM algorithms [23]. The algorithm works in two stages. In the first stage we train the model using a training dataset of satellite images and in the second stage it tests the model for a given set of training dataset of images. The training dataset of this research consists of 2200 geometrical objects. Seventy-five percent of the data is used for training and twenty-five percent of the data is used for testing. We have tried with different values for training and testing phases, like 85% for training and 15% for testing, 80% for training and 20% for testing, 70% for training and 30% for testing, 75% for training and 25% for testing. And it is found that if we use 75% for training and 25 % for testing then the better accuracies were observed. So, we have come up with 75% for the training phase, the rest of them are used for testing. The input to the algorithm is a set of thirteen bands from Sentinel 2 images and output of different vegetation indices and PCA. The optimum number of indices can be selected from these available indices in order to get the better accuracy of vegetation analysis. After experimenting with different types of kernels and gamma values, the research paper uses radial basis function (RBF) kernel for classification and gamma value of 0.5 and cost of 20. The accuracy of the algorithm accessed for accuracy with the help of Kappa coefficients.

Mathematical model for tree canopy assessments
In In this research, tree canopy assessments are done. It is actually a problem of satellite image classification. The satellite image is represented using a matrix S, where S is m×n matrix. For satellite image classification applications, each row of R, the n-vector xi contains the values at each frequency wavelength of the spectrum sample. Each column, λj contains all the observations of one value of an attribute. PCA is used to overcome problems of a large number of dimensions. This process is known as dimensionality reduction. PCA transforms the inputs λ1, λ2, . . ., λN into another set of column vectors µ1, µ2, . . ., µN. The vector ν has Int J Artif Intell ISSN: 2252-8938  Novel approach of association rule mining for tree canopy assessment (Nilkamal More) 775 features that the input data's information content is stored in the first few coefficients called as the principal component scores. It tries to reduce the dimensions in the output. New features will be orthogonal to each other. It discards some of the components. The input matrix λ is specified by a ᵦ×ᵧ matrix. The equation calculates the value of z which is the scaled value with (1).
In this equation of calculation of scaled value, ¥i = The initial value, µ = mean, σ = standard deviation PCA generates new coefficients which are independent of each other. The covariance between aby two variables ν1 and ν2 is calculated using the following (2) for m number of such components as: Then values of Eigen vector and Eigen values are derived. In general, the eigenvector of a matrix S is the vector. It holds following relationship.
where ᵧ is a scalar value called the eigen value. The linear transformation is defined by a formula given in (4).
The I is the Identity matrix. The next step in PCA is to choose "η" number of eigenvectors with the largest value of eigenvalues. Sort the eigenvectors in descending order of eigenvalues. Then "n" values from them are selected. The value of "η" is the number of dimensions that you wish to have in the derived dataset. It is required to map our data to the generated feature space by re-organizing the data from the original space to the feature space represented by the principal components.
Final Changed Data = Features * where is transpose of Z. This final transformed data is given as an input to satellite image classification algorithm. Satellite image classification can be modelled mathematically using Cellular Automata (CA). CA model considers a vast array of cells having a predefined number of positions which change at distinct time intermissions using certain transition rules. On the same line, the satellite image also has an ample number of pixels with some finite number of classes which transform at discrete intervals of time. So, the model definition of satellite image classification using SVM can be given as. Let us assume a current state of a pixel p of a satellite image, at time t as .
In (6), is a binary variable indicating satellite image data of particular type or not, is the transition indicator function, it shows whether the pixel under consideration has changed or not in current timestamp. This update rule for transition indicators can be modelled using supervised classification techniques like SVMs.

RESULTS AND DISCUSSION
Result analysis section gives information about the Sentinel 2 data set of Mumbai and Navi Mumbai region. Data cleaning using cloud removal is carried out on the dataset. Then information about the geographic location used for study is explained in this section. Google earth engine (GEE) and Google colaboratory is used in this research as computing platforms. The results obtained in empirical study of this research are discussed in this section.

Dataset
Sentinel-2 images are collected from two European satellites. It gives access to wide-swath (up to 290 km). There are thirteen spectral bands. The high-resolution images with spatial resolution of 10 m, 20 m and 60 m are available for Sentinel-2. Data is available at 5-day revisit frequency. The Sentinel-2 dataset with multispectral bands has 13 spectral bands: visible, red, green, blue and near-infrared (NIR) at 10 meters, red edge, and SWIR at 20 meters, and atmospheric bands at 60 meters of spatial resolution. Sentinel2 is useful in many applications like vegetation change detection, water bodies' detection, soil texture analysis in coastal as well as urban areas. The Sentinel-2 images are downloaded from the Google earth engine (GEE) or Scihub. The capability of an instrument to differentiate differences in light intensity and reflectance is called a radiometric resolution of a satellite image. The accurately sensed satellite image can be obtained with the greater radiometric resolution. Bits are used to express the radiometric resolution. Eight to sixteen bits is the typical range of radiometric resolution.

Data cleaning using cloud removal
A remote sensed satellite image has to be cleaned before applying classification algorithms as noisy images often lead to ambiguous results. This data cleaning approach used here reduces annotation unpredictability and salt-and-pepper noise. In this research paper, cloud masking technique is used prior to the classification process. It uses the metadata available with the satellite image dataset on Google earth engine. Data cleaning involves removal of noise from satellite images of the Sentinel-2 dataset. The algorithm for pre-processing Sentinel 2 imagery explained in algorithm 3. 4. m_k=Band_qa.bitwiseAnd(cloudRemovalBitMask).eq(0).and(qa.bitwiseAnd (cirrusRemovalBitMask).eq(0)); 5. return (Updated_Data.updateMask(m_k).divide(10000)); }

Compute platform
Google earth engine (GEE) (https://earthengine.google.com/) is a GPU/TPU enabled high performance computing based cloud computing platform for geospatial analysis of satellite image datasets. It assigns resources dynamically to cater with the computation intensive tasks. It is available for research oriented, academic, and non-commercial application purposes. It provides an efficient way to handle computationally intensive tasks of advanced image processing. It can be used for development purposes with an uncomplicated online application interface of GEE code editor. It enables users to train, test and develop algorithms interactively. It provides better visualization of results of analysis. Various distributed technologies are also available to process this geospatial data [24].

Geographic area for study
In this work, Mumbai and Navi Mumbai which are India's mega cities, and the business capital of Maharashtra are considered for study. Mumbai and Navi Mumbai area which is used for study consists of 964 sq. km. Mumbai is located at a latitude and longitude of 19.076090 N and 72.877426 E, respectively. It is on the west coast of India. It is a densely populated city with a population of approximately 12.5 million. As Mumbai is a densely populated and there is lots of industrial area, it is difficult to assess the tree canopy. On this foundation, this research paper chooses Mumbai for research purposes.

Experimental results
The study area measures approximately 963.78 km 2 , which includes Mumbai and Navi Mumbai area of the city. Figure 2 shows the typical input and output of the system. Government can use this tool to assess the tree canopy of a particular area. So, measures can be carried out to select an area for plantation to reduce pollution. In Figure 2, the left block shows an input image with boundaries marked for the area of interest that is Mumbai and the right block shows a classified image with green colour showing the trees in the area of interest and yellow colour shows deforestation area. The plot in Figure 3 shows fifteen PCA coefficients on the X-axis and percentage of explained variances components on Y axis. After finding eigenvectors, order the eigenvalues in descending order. The eigenvectors give the components in order of their significance. This plot is useful to decide the number of components useful for classification. So, we can see, the first three components are useful for classification. The average thematic accuracy for the given dataset is 91.10% to 96.49%. This is compared with the accuracy of a research paper which uses image fusion approach which also uses GPU-enable environment of GEE [25]. Figure 4 shows the accuracies obtained using confusion matrix method.   The season is selected in such a way that it gives efficient vegetation analysis and as per guidelines given on website of GIM [26]. So, from Figure 4, we will come to know that the overall accuracy is ranging from 96.67% to 98.94%. This is observed because of the optimal number of indices selected and PCA. Figure 4 explains the values for Kappa coefficient and overall accuracy for the method of optimal indices selection for tree canopy assessments. The data is selected for three seasons i.e., January to March, April to June and October to December. The analysis is done for 4 years 2016 to 2019. Accuracy plot is obtained by taking seasons on x axis and accuracy on Y axis. Figure 5 shows the results of classification process in assessing tree canopy. So, it is found that combination of all the vegetation indices along with PCA transformation gave more accuracy for classification.

CONCLUSION
This research experimentally proves that tree canopy detection in the Mumbai and Navi Mumbai area based on optimal number of combinations of vegetation indices of high spectral resolution images of sentinel 2 dataset gives more accurate thematic accuracy. The study area of Mumbai is observed from 2016 to 2019. Season wise critical analysis of tree canopy plays an important role as there will be the same environmental conditions available throughout the season. The Season wise analysis gives more accurate overall accuracy. The thematic accuracy observed with kappa coefficient in this technique is 96%. It is more than what is observed in recent literature. They got 90% thematic accuracy. This study can be further extended to produce time efficient tree canopy assessments in the city. More advanced machine learning and advanced deep learning techniques can be used in to further improve the performance of the system. GIM can use this technology to assess tree canopy in the city and decide area for plantation.