Massively scalable density based clustering (DBSCAN) on the HPCC systems big data platform

Received Aug 1, 2020 Revised Jan 9, 2021 Accepted Feb 7, 2021 Dealing with large samples of unlabeled data is a key challenge in today’s world, especially in applications such as traffic pattern analysis and disaster management. DBSCAN, or density based spatial clustering of applications with noise, is a well-known density-based clustering algorithm. Its key strengths lie in its capability to detect outliers and handle arbitrarily shaped clusters. However, the algorithm, being fundamentally sequential in nature, proves expensive and time consuming when operated on extensively large data chunks. This paper thus presents a novel implementation of a parallel and distributed DBSCAN algorithm on the HPCC systems platform. The algorithm seeks to fully parallelize the algorithm implementation by making use of HPCC systems optimal distributed architecture and performing a treebased union to merge local clusters. The proposed approach* was tested both on synthetic as well as standard datasets (MFCCs Data Set) and found to be completely accurate. Additionally, when compared against a single node setup, a significant decrease in computation time was observed with no impact to accuracy. The parallelized algorithm performed eight times better for higher number of data points and takes exponentially lesser time as the number of data points increases.


INTRODUCTION
Clustering, or the grouping of data into clusters, is one of the most fundamental techniques in dealing with large chunks of unlabeled data. It involves grouping the data into meaningful subclasses, such that the inter-class distances are maximized, and the intra-class distances are minimized. There are four major domains of clustering algorithms, namely: hierarchy-based, partitioning-based, density-based and grid-based [1] DBSCAN, or density based spatial clustering of applications with noise, falls under the category of density-based clustering algorithm. This algorithm is based on the premise that for every data point in the cluster, its neighborhood within a given radius(eps) has to contain a minimum number of points(minpts). Thus, for a given threshold, the density of the neighborhood for every point should exceed this threshold. When dealing with large samples of data, performing this clustering in a sequential manner is time consuming and often expensive. Thus, the proposed paper presents a distributed parallel DBSCAN ISSN: 2252-8938 Int J Artif Intell, Vol. 10, No. 1, March 2021: 207 -214 208 algorithm, in order to overcome these performance bottlenecks without any impact to accuracy. The proposed algorithm is implemented on the HPCC systems [2].
HPCC systems is an open-source, lightweight and powerful big data management plat-form, which serves as an end-to-end Data Lake management solution. Its key advantages arise from its scalability, performance and usability. The platform serves as an alternative to existing big-data platforms such as apache hadoop, apache spark and data torrent RTS [3]. The platform is supported by its underlying programming language ECL, which is implicitly parallel and declarative in nature, and provides several constructs to simplify parallel compute operations. The general HPCC systems architecture is shown in Figure 1. HPCC systems is further supported by a pair of powerful data engines: THOR and ROXIE. THOR serves as the data refinery engine, and gives the user control over data transformations. It also facilitates optimal operational capacity on mixed schema data. ROXIE serves as the search engine that facilitates highspeed real-time queries through interfaces such as REST, SOAP and XML. It is responsible in greatly reducing the latency associated with querying. The remainder of the paper is organized as follows. Section 2 analyses the existing state of the art for distributed parallel clustering algorithms. Section 3 details both the traditional DBSCAN algorithm as well as the novel parallel DBSCAN algorithm. Section 4 discusses the experiments carried out and the results obtained. Finally, section 5 provides a conclusion to the work as well as propositions for future work.

LITERATURE REVIEW
A study of the state-of-the-art techniques reveals certain pre-existing methods that seek to improve the performance of the DBSCAN algorithm. The authors in [4][5] showed that the DBSCAN algorithm can be solved in O(nlogn) for Euclidean 2D spaces. However, this method cannot be directly applied on massive datasets distributed across multiple nodes for higher dimensions. There were efforts presented by authors in [6] which presented an approximate DBSCAN in O(nlog(n)) time. The authors [7] also achieved the same but in O(n) time. However, none of the approaches talk about paralellization required in big data applications.
Some of the other techniques make use of GPU' s and various concepts of parallelism to achieve the requisite level of performance. Other proposals to improve the performance include the use of R*-Tree structure [8], KDTREE algorithm [9], special data partition algorithm or disjoint set data structure [10][11][12][13]. The implementation of these techniques however, is performed on a single physical system. Due to limitations on the number of cores, and memory of a single system the amount of scalability and performance boost that can be achieved using them is limited.
When working with large real-world datasets the use of a distributed system for clustering is more appropriate [14][15][16][17]. A popular strategy for implementing the distributed DBSCAN algorithm is the usage of the disjoint set data structure. This approach was proposed in the PDSDBSCAN algorithm by Patwary et.al. [18][19]. The algorithm makes use of a bottom-up structure to construct the clusters as a collection of hierarchical trees. In this approach the clustering is performed in 2 phases. In the first phase local clustering is performed via the regular DBSCAN algorithm across each node. In the second phase the global union of clusters across nodes is performed using the union-find operation. PDSDBSCAN was shown to significantly outperform the previous approaches used to parallelize DBSCAN. Speedups up to a factor of 25.97 was achieved when using 40 cores on datasets containing several hundred million high-dimensional points. PDSDBSCAN is implemented using both OpenMP and MPI. An improvement to the above algorithm was proposed by Hu et al. [20] wherein the authors made use of the distributed file system of the Kunpeng system. In this algorithm data was distributed across different workers and message passing interface (MPI) communication pattern was used to communicate the data between the workers. Further the large overhead associated with MPI was overcome using a novel parameter server frame-work. PS-DBSCAN was show to outperform the MPI-based PDSDBSCAN-D with a 2-10 times speedup on communication efficiency in both real-world and synthetic datasets, and the speedup was found to increase with the increase in the number of processor cores and the dataset scale. In RP-DBSCAN [21] the authors made use of pseudo random partitioning together with a two-level cell dictionary. RP-DBSCAN was implemented using 48 cores (12 Azure Ma-chines) on an Apache Spark system. RP-DBSCAN was found to achieve an almost perfect load balance among data splits in local clustering and did not duplicate points among them. Hence RP-DBSCAN dramatically outperformed the state-of-the-art parallel DBSCAN algorithms by up to 180 times. Furthermore, only RP-DBSCAN was able to handle large data sets (362 GB) whereas the other algorithms could not.
Although the usage of disjoint set data structure for DBSCAN is popular in distributed systems several other methods such as the convex-hull method are used as well [22]. In the convex-hull method a convex hull enclosing the points within a cluster is created locally across nodes. In the merge operation, the convex hulls of the local clusters are merged based on the amount of overlap. However, this algorithm is not capable of handling concave shapes and hence does not correctly capture the essence of the underlying points. In the algorithm [23] the authors propose a grid based disjoint set algorithm for solving the problem efficiently. However due to the fact that distribution of data by Thor among nodes is random, it is not possible to apply the algorithm directly. These existing algorithms are not a straightforward fit into the HPCC Big Data Platform and require modifications. Hence there is a need for a better algorithm which can exploit the components of HPCC such as Thor and Roxie in order to gain massive performance improvements during parallel execution of the DBSCAN algorithm.

METHODOLOGY 3.1. DBSCAN algorithm
The DBSCAN algorithm [24] is non-parametric spatial clustering algorithm. The main concept of the clustering algorithm is that for each cluster the number of points within eps distance is greater than certain minimum points i.e.., threshold density. The pseudo code for DBSCAN algorithm is elucidated in Algorithm 1. The algorithm starts with a point p ∈ D and retrieves the neighboring points in the epsneighborhood of itself. If the retrieved neighborhood contains at least minpts points, then a new cluster, C is added. The algorithm then finds all points in X, that are reachable from x (neighboring point) and adds them to the cluster C. If the eps-neighborhood of x has less than minpts, then x is marked as noise. The pseudo code can be summarized in Algorithm 1.

Algorithm 1. DBSCAN algorithm
The time complexity of Algorithm 1 is O (n^2), where n is the number of points in X. But if R* trees or K-D trees are used to obtain the nearest neighbors the time complexity is reduced to O (n*log n). The ISSN: 2252-8938 210 DBSCAN algorithm is a highly sequential algorithm. This makes it computationally inefficient task when applied to large amounts of data, especially on big data platforms. Hence there is a need to parallelize the algorithm for achieving better efficiency in such big data systems. The following section gives the overall procedure for the same.

Parallel DBSCAN algorithm
The proposed methodology provides a scalable solution to the DBSCAN clustering on HPCC big data platform. This is achieved by decomposing the clustering algorithm into three stages namely-spraying of data, local clustering phase and global merging of results. Each of these stages are explained in sections:

Spraying of data
In this stage the data points are assigned global unique ids and distributed across different nodes in a cluster by Thor engine. The distribution by the Thor engine ensures that the data points are distributed evenly across all the nodes. Each of the local nodes then sort the data points by their unique ids and send the data to local clustering stage. These unique ids do not change till the end of all the stages and they identify each data point uniquely. One more important point to note is that the algorithm does not depend on the nature of distribution of data across the nodes.

Local clustering
In this phase each node performs DBSCAN clustering operations on the data points that are local to the node in the HPCC cluster. The local DBSCAN algorithm uses disjoint set operations namely-UNION and FIND for performing local DBSCAN clustering. Each set (final cluster) is uniquely represented by highest core point. The FIND operation is used to identify the parent i.e.., highest core point, for each point (node) in the tree. The FIND operation is summarized in Algorithm 2. The union operation merges two trees to form a bigger tree based on rem's algorithm. The union operation is performed on a core point and other core or a non-core point. The union operation is summarized in Algorithm 3.

Algorithm 2. Find operation Algorithm 3. Union operation
Initially each data point is represented as a node containing the fields and parent id. The parent of each data point is made to point to itself i.e.., p(x)=x and the points are marked as non-core point. Each cluster is uniquely represented by the parent id (highest core point of that cluster). The pseudo code for local DBSCAN clustering is given in Algorithm 4. Let X denote the set of data points that are in the local node and Y denote set of data points that are present on a remote node. The points that are in the eps neighborhood of each x ∈ X in the set X U Y is found. The neighborhood points can be found by using numerous distance metrics like Euclidean, Manhattan, and Chebyshev's distance functions. If the number of points in the neighborhood is greater than the minimum number of points, then it is marked as a core point. Each point y in the neighborhood of x can either be local to that node or in a remote node. If y is in local node then there can be two cases: (a) y is core point, (b) y is non-core point i.e.., a border point. If condition (a) is true Union (x, y) is performed. If condition (b) is true then its marked as visited and Union (x, y) is performed. If y is a remote point then y can be either a core point or a non-core point. If the latter condition is true then the neighborhood of y is found. If the neighborhood does not contain minimum points then y is a border point else it is marked as a core point and union (x, y) is performed. If the point y is core then union (x, y) operation is performed and the process is continued for other points. Therefore, at the end of the algorithm it can be noted that the points which are noisy have single trees (singleton trees) with its parent pointing to itself. If the point is a border point then it is assigned to the highest core point of the cluster of its nearest neighbor. If the point is a core point then the Union operation assigns it to parent with highest core point. The time complexity of the local DBSCAN operation is O(n 2 ). The above algorithm merges points to its clusters (trees) without any particular ordering of data-points. The output mapping of datapoint to the parent ids is input to the next stage as a global ttream of data points so that each data point is returned one by one from the Thor.

Global merge
In this phase, the trees formed in the previous stage are merged to obtain the global clusters. A merge of trees across nodes happens when a point belongs to more than one tree in different nodes. After the trees are merged the final clusters are obtained which are represented by their highest core point across all nodes. This can be achieved by using similar approaches of union and find operations of disjoint set structure as described before. The pseudo code is elucidated in Algorithm 5. Initially the output mapping of stage 2 (only core points and local points that node) is distributed back among the nodes as well as non outlier points inorder to find the ultimate id. Each data point has two attributes -local parent id (local to the node) and global parent id (global/ultimate id). In each node, the remote parents of each data point in remote nodes are found (by having a local lookup table). In cases such as a highest core point being present in another node, the trees are merged (changing the ids) and the union operation results in the highest core point to that cluster (by changing the parent ids). Hence at the end of the algorithm each cluster is represented by highest core point as cluster id. Since the entire computation is done parallelly, an increased speedup is achieved. After this, clusters are renumbered by assigning 0 to the outliers found in stage two and 1 based indexing to the rest of points based on the order of their occurrences in dataset. The final result is returned to ECL Watch for the user as

EXPERIMENTS AND RESULTS
The Parallel DBSCAN algorithm was implemented in Enterprise Control Language on HPCC systems. Each node in the cluster had 6GB of RAM and 128GB of Hard Disk. The processor used was intel Xeon with 2.4GHz clock frequency. Table 1 elucidates the processor specifications. In order to measure the performance of the parallel DBSCAN algorithm a comparison between single node and multi node (consisting of two nodes and three nodes) clusters was made. Each dataset was run with different eps and minpts. A Higher value of minpts increases the noise counts. The results are tabulated as shown in Table 2.
The parallel DBSCAN implementation and serial DBSCAN implementation were tested separately for conventional single node processing versus the use of a two node HPCC cluster. The results obtained were compared with the output of sklearn implementation of DBSCAN to compare the cluster accuracy. The cluster densities in both cases were found to be identical. The test set consisted of 6 test cases including both synthetic datasets and real dataset (frogs' MFCC dataset* [25]). The test sets had up to 20 dimensions. Also, one more point to note is that the data is distributed uniformly among the nodes. The data points in disjoint sets (representing each cluster) may or may not be in the same node. Hence the experiments conducted does not assume any particular distribution of data points among the worker nodes and no further redistribution is required at any stage of algorithm. The outputs indicate that substantial improvements in time complexity of the parallel algorithm than the serial version (single node).
The tabulated results obtained are depicted graphically in Figure 2. We can observe that there is no significant improvement in the execution time for lower number of points for higher number of nodes. It is in fact slightly higher due to the fact that the time required to distribute data points is not compensated well by the time gained due to parallelization. However, it is clear that for larger number of data points as the number of nodes increases, substantial speedup is obtained. The results clearly indicate that the multi node setup outperforms the single node setup in all cases. When the number of points is less, there is no significant improvement of parallel DBSCAN over sequential DBSCAN. As the number of data points increases the parallel algorithm performs way better (4x better for 30000 points compared to two node cluster) than its serial counterpart. The code for the same was merged to HPCC systems repository and can be found on Github.  Figure 2. Comparison between serial DBSCAN and parallel DBSCAN on HPCC systems

CONCLUSION
As the number of big data applications are increasing in the recent years there is a need for big data platforms which enable faster data processing, and HPCC's Platform is a suitable alternative. HPCC platform supports cross platform developments in languages like C++, python, which makes it to develop applications at a faster pace. The thor and roxie components of HPCC platform enables faster data ingestion and data query across multiple nodes which makes it efficient in implementing machine learning algorithms. Also, the platform parallelizes the sequential algorithms across multiple nodes efficiently. This makes it possible for highly sequential and computation intensive algorithms like DBSCAN to be implemented at a scale. The experiments demonstrated highlight the fact that the platform can divide the computation across nodes efficiently. In the future, the existing functionality of the DBSCAN algorithm on HPCC can be used in various data analytics applications especially in areas of geo-spatial clustering's.