A fuzzy neighborhood rough set method for anomaly detection in large scale data

ABSTRACT


INTRODUCTION
Outliers are the unusual, unexpected patterns in the observed world. Outliers exist extensively in real world, and they are generated from different sources: a heavily tailed distribution or errors in inputting the data. While there is no single, generally accepted, formal definition of an outlier, Hawkins' definition captures the spirit: "an outlier is an observation that deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism" [1]. Anomaly detection is an important problem that has been researched within diverse research areas and application domains such as fraud detection [2], intrusion discovery [3], video surveillance, pharmaceutical test and weather prediction. There are different surveys about classical outliers and abnormal detectio. They vary between density based approaches [3], statistical [4], distance-based [5], neural networks and machine learning techniques.
Recent research studies on outlier detection have focused on examining the nearest neighbor structure of a data object to measure its outlierness degree [6][7]. Such techniques are based on the key assumption that instances of normal data occur in dense neighborhoods, while outliers occur far away from their closest neighbors [8]. Popular outlier detection methods require the pairwise comparison of objects to compute the nearest neighbors. This quadratic problem is not scalable to large data sets, making outlier detection for large scale data still an open challenge. This paper proposes a fast outlier detection method for large scale datasets, which consists of two steps: a granulation of the universe into parts with the same properties then the computing of the degree of outlierness called Fuzzy neighborhood rough set outlier factor (FNROF) for each granule formed. Granulation of the obesevable universe involves grouping of similar elements into granules. With granulated views, we deal with approximations of concepts, represented by subsets of the universe, in terms of  [9]. The remainder of this paper is organized as follows. In the next section, we present some preliminaries of rough set theory that are relevant to this paper and discussion of the granularity of knowledge in connection with rough and fuzzy sets. In Section 3, we propose an efficient parallel computing system based on Map Reduce in order to improve the speed of computation and the algorithm proposed that deal with more complex outlier detection problems for large scale data.

ROUGH SETS (RST)
Rough set theory RST [10][11] is a new mathematical approach to imperfect knowledge. The theory has attracted attention of many researchers and practitioners all over the world, who contributed essentially to its development and applications. The main advantage of rough set theory in data analysis is that it does not need any preliminary or additional information about data. Rough set theory is a popular and powerful machine learning tool. It is especially suitable for dealing with information systems that exhibit inconsistencies. In rough set theory, an information table is defined as a tuple T = (U, A) where U and A are two finite, non-empty sets with U the universe of primitive objects and A the set of attributes. Each attribute or feature a ∈ A is associated with a set Va of its value, called the domain of a. We may partition the attribute set A into two subsets C and D, called condition and decision attributes, respectively. Let P ⊂ A be a subset of attributes. The indiscernibility relation, denoted by: Where a(x) denotes the value of feature of object x. If (x, y) ∈ IND (P), x and y are said to be indiscernible with respect to P. The family of all equivalence classes of IND (P), referring to a partition of U determined by P, is denoted by U/IND(P). Each element in U/IND (P) is a set of indiscernible objects with respect to P. The family of all equivalence classes of IND (P), referring to a partition of U determined by P, is denoted by U/IND (P). Where For any concept X ⊆ U, X could be approximated by the P-lower approximation and P-upper approximation using the knowledge of P. The lower approximation of X is the set of objects of U that are surely in X: The upper approximation of X is the set of objects of U that are possibly in X, defined as: The concept defining the set of objects that can possibly, but not certainly, be classified in a specific way is called the boundary region, which is defined as: BN( ) = ( ) − ( ) as shown in Figure 1.

Rough set and fuzzy discretization
The extraction of knowledge from a huge volume of data using rough set methods requires the transformation ofcontinuous value attributes to discrete intervals, in order to form a grid structure and then form clusters from the cells in the grid structure. Clusters correspond to regions that are denser in data points than their surroundings. The great advantage of grid-based clustering is a significant reduction in time complexity, especially for very large data sets. The concepts of real rough space, it is well known that one of the research premises in the classical rough sets theory is the information or the data to be discrete. Discretization can be viewed as a data reduction technique which reduces the range of values of a continuous values attribute into a minimum number of discrete intervals. The numbers of cut-points can determine the level of data reduction. The fewerthe number of cut-points the more the data will be reduced and hence a generalized classifierwill be possible. The term "cut-point" refers to a real value within the range of continuous values that divides the range into intervals. Cut-point is also known as split-point. The great advantage of gridbased clustering is a significant reduction in time complexity, especially for very large data sets. But during the discretization process, if the discretization is too rough, much useful information may be lost. And if the discretization is too exact, it will take a lot of time complexity. So, it can be said that the disadvantages of classical rough sets are too much depending on good or bad of the discretization methods and the limited application domain. Let = ( 1 , 2 , . . , ) be a provided dataset having n objects and Aattributes, minj =min( ), v maxj =max( ) be the minimum and maximum values of attributes i. Each attribute [ mini ,V maxi ] is equally divided into M intervals = ( maxi -v mini )/M. The set of all initial interval of an attribute i is shown as:

Fuzzy rough sets
Fuzzy rough set theory extends rough set theory to data with continuous attributes, and detects degrees of inconsistency in the data. Key to this is turning the indiscernibility relation into a gradual relation. The fuzzy set is actually a fundamentally broader set compared with the classical or crisp set. The classical set only considers a limited number of degrees of membership such as '0' or '1', or a range of data with limited degrees of membership as shown in Figure 2.
Definition 1: (Fuzzy Sets) A fuzzy set, F, defined over universe X is a function defined as: Function ( ) is called the membership function, which maps object x to the membership space. The rough membership function expresses conditional probability that x belongs to X given P and can be interpreted as a degree that x belongs to X. One of the most important concepts in fuzzy set theory and applications is the αcut decomposition theorem developed by Zadeh in 1971 under the name resolution identity. These cuts are crisp sets associated with certain levels α that represent distinct grades of membership.

Rough sets: neighborhood systems
The concept of information granulation was first introduced by Zadeh in the context of fuzzy sets in 1979 [12]. The basic ideas of information granulation have appeared in fields, such as interval analysis, quantization, rough set theory and many others. There is a fast growing and renewed interest in the study of information granulation and computations under the term of Granular Computing (GrC). [13] Granulation of a universe involves grouping of similar elements parts, or the grouping of individual elements or objects into a family of disjoint subsets, based on available information and knowledge. The combination of topological spaces and rough sets and the properties of topological rough spaces are discussed [14] used neighborhood systems and topological concept in the study of approximations. Neighborhood system is a mathematical structure of granular computing to model granules, and can be used to compute structure of granules and/or between granules. A neighborhood system at a point is a framework to capture the concept of "near" objects, and any subset of objects can be approximated by a set of neighborhoods. A neighborhood system defines a set of binary relations, and a set of binary relationships can be used to define a neighborhood system. Definition 3 (neighborhood of object xi): Given an arbitrary ∈ and P ⊆ , the nearest neighborhood ( ) of xi in feature space P is defined as: Where : × → +, a distance (similarity) function and R+ is the set of non-negative real number. : The neighborhood information granule included objects xi and the size of the neighborhood depends on threshold .
For each value of ∈ +, we propose the following neighborhood system as the collection of all neighborhoods of x ∈ U as: Where s is a sliding windows for overlapping computation: s<M. Theorem 1: For each 1 ⊆ , 2 ⊆ : ( ) is a neighborhood relation induced in feature subspace P.
We have: Given a set of objects U and a neighborhood system Ns over U, we call <U, Ns> a neighborhood approximation space. The lower and upper approximations ( , ) of X in <U, Ns>, are defined as: Obviously, ⊆ ⊆ . The boundary region of X in the approximation space is defined as: The size of boundary region reflects the degree of roughness of set X in the approximation space <U,Ns>. Assuming X is the sample subset with a decision label; generally speaking, we hope the boundary region of the decision should be as small as possible for decreasing uncertainty in decision. The size of boundary region depends on X, attributes to describe U. For a fixed pair of numbers ( 0 , 1 ) ∈ [0, 1]×[0, 1], we obtain a submodel in which a crisp set is approximated in a crisp approximation space ℜ 0 = ( , ℜ 0 ) The result is a rough set (( ℜ 0 ( ), ℜ 0 ( )) with the reference set .Each granule in fuzzy sets is a neighborhood of an element of the universe. The approximation is defined by show in Figure 3: In this case, the subset 1 (lower approximation) contains two clusters C1 (grid 2) and C2 (grid 3)

Figure 3. Fuzzy rough set approximation
The root grid Grid0 ( ) with the coarsest granularity covers the entire datasets, which contains one sub grids: grids 1 ( : 0 ) at level 1 also contains two sub grids at level 2 ( . : 1 )

Fuzzy neighborhood rough set outlier factor (FNROF)
In this paper, a new method for ranking outlier which is proposed based on fuzzy rough set denoted "Fuzzy neighborhood rough set outlier factor" FNROF. After dividing each dimension into intervals of equal length M, the density distribution of each cell (information granularity) can be defined as the ratio of its density and the average density of its k neighboring cells.
Proof: = * = A normalized score of is given as follow: It's viewed as the relative density measure of (di) with respect to the density of n surrounding neighbor's cell. When the probability is uniformly distributed, we are most uncertain about the outcome, the entropy (score) is the highest in this case. On the other hand, when the data points have a highly probability mass function, we know that the variable is likely to fall within a small set of outcomes so the uncertainty and the entropy (score) are low. The size of interval must be carefully selected. If the interval size is too small, there will be many cells so that the average number of points in each cell can be too small. On the other hand, if the interval size is too large, we may not be able to capture the differences in density in different regions of the space. Unfortunately, without knowing the distribution of the data sets, it is difficult to estimate the minimal average number of points required in each cell to have the correct result. That is, is a core cell and is in its neighborhood.
Definition 5: Density-connected. A cell is density-connected to a cell if there is a cell such that both and are density-reachable from as shown in Figure 4. Figure 4. The concept of density-reachability and density-connectivity to form clusters as contiguous dense regions in lower approximation

A novel approach: A high-performance parallel and distributed computation using mapreduce
In order to compute an optimal set of cut-points, most of discretization algorithms perform an iterative search in the space of candidate discretizations, using different types of scoring functions for evaluating a discretization, that take a lot of time. In this paper, we propose a parallel process of discretization based on MapReduce using sliding grid. A sliding grid is specified by defining its range M and slide S. The range M is an interval of discretization while the slide S specifies the portion of the grid that is moved forward. A sliding window is specified as a tuple (M,s). A smooth sliding specification is highly desired where the slide S issmall relative to the range M. where < . The proposed algorithm based on MapReduce computed for each node i( ⊆ ) is a parallel process that consists of three steps: map, shuffle, and reduce as shown in Figure 5. Inmap phase, for each grid given tuple (M,s), we generates a list (key= ,value= ) where is a score of . In shuffle phase, the output pairs are partitioned and then transferred to reducers. In reduce phase, pairs with the same key are grouped together as ( , list( )) as shown in Figure 6.

EXPERIMENTS AND RESULTS
The algorithm proposed is tested with synthetic and real data collected from NOAA center. The implementation of this work was realized in R using RStudio. Datasets NOAA: [15] The National Climatic Data Center -NOAA: collects a wide range of data; including sensor streams with temporal information, sensor spatial information, temperature, etc.

Improvement in search time efficiency
The purpose of the experiment was to compare the performance between the algorithm proposed MR-FNROF and the original LOF algorithm in terms of matching detected outliers and execution time. Comparing the performance of the tow methods, it shows that our method have a very fast processing time with acceptable trade-off errors as show in Table 1.

Performance of MR-FNROF according to number of workers nodes
The second experiment shows that reduction of the risk of a Type I & II error is performed by increasing the number of workers nodes as shown in Figure 7. With high number of workers nodes, we are getting more outlier detected in upper approximation rough set (less of type II errors).

CONCLUSION
The aim of this paper is to propose a new algorithm of outlier detection that reduces the computation time required by using granular computing method and fuzzy rough set thoery. The algorithm MR-FNROF divides the universes into a smaller number of granules, and calculates the factor of outlierness for each granule. To examine the effectiveness of the proposed method, several experiments incorporating different parameters were conducted. The proposed method MR-FNROF, demonstrated a significant computation time reduction. Moreover, it can also be effectively used for real-time outlier detection.