Parallel processing using big data and machine learning techniques for intrusion detection

Received May 1, 2020 Revised Jun 22, 2020 Accepted Jul 9, 2020 Currently, information technology is used in all the life domains. Many devices and equipment produce data and transfer them across the network. These transfers are not always secured and can contain new menaces and attacks invisible by the current security tools. Moreover, the large amount and variety of the exchanged data make the identification of the intrusions more difficult in terms of detection time. To solve these issues, we suggest in this paper, a new approach based on storing the large amount and variety of network traffic data employing big data techniques, and analyzing these data with machine learning algorithms, in a distributed and parallel way, in order to detect new hidden intrusions with less processing time. According to the results of the experiments, the detection accuracy of the machine learning methods reaches up to 99.9%, and their processing time has been reduced considerably by applying them in a parallel and distributed way, which proves that our proposed model is very effective for the detection of new hidden intrusions.


INTRODUCTION
Nowadays, information technology is employed in all areas of life (finance, education, weather, etc...), various equipment, namely, computers, servers, tablets and others devices, are producing continuously data and exchanging it through the network. However, these exchanges between these equipments are not always secured, and they can contain new hidden attacks. While the existing tools and strategies of security are established on predefined methods and algorithms to identify intrusions, they don't have the ability to detect new threats. This pushes us to think about new methods and techniques that can evolve to disclose new menaces.
In addition, the data passing through the network is so large and can be of several types, which provoke difficulties related to threats detection time for the current devices of security. With the fast growth of the generated data like videos, sounds, emails, etc..., in all sectors, the old data management tools have become obsolete, they are not able to store or manage this large amount of data, as a consequence, a new concept was conceived called Big Data to define new rules for the management and storage of this large mass of data. Also, a collection of classification methods of the Machine Learning domain have appeared recently, and was associated with this newborn Big Data, in order to extract invisible information from it.

Collector
The collector is a traffic listener, it is a software that collect the traffic passing through the network, it is installed on a network machine, it listens, captures, and saves network traffic on the same machine in order to load it to the big data cluster via the ETL.

Extract transform load (ETL)
An ETL is a software that aims to extract data from a source, transform it, then load it to a destination [13], so, it is installed on the same machine of the collector, it is responsible for loading the caught traffic from the network by the collector, to the big data cluster.

Big data cluster
Because of the large amount and variety of traffic data exchanged all the time between the local network and the Internet, we have set up a Big Data cluster. The two most used big data management frameworks are Hadoop [14] and Spark [8], they are composed of two components, the first called Hadoop distributed file system (HDFS) is reserved for storing data, the second is reserved for distributed processing of data via the MapReduce program [15]. We used Hadoop because it is more powerful than Spark in terms of data security [16].

Analysis machine
Due to the large amount and variety of data that can be collected across the network, it has become difficult to process them with the old analysis methods and tools of security [17], contrariwise, Machine Learning methods have the capacity to extract information hidden in this large volume and variety of data [18], that's why we will use them to process network traffic. So, the analysis machine is also a machine on the local network, on which we have installed software that will launch Machine Learning algorithms, in order to process the data already stored in the Big Data cluster.

EXPERIMENTAL ENVIRONMENT
In this part, we present the used methods for the analysis, the chosen data for the experimentation, the validation method, the evaluation metrics, and the work environment.

Analysis methods
There are several machine learning methods, so it's not easy to test them all, we tried to test only the most known and used of them, which are support vector machine (SVM) [19], K-nearest neighbors (KNN) [20], and decision tree [21].  Support vector machine (SVM): it is a machine learning method, which is intended to solve binary and multiple classification problems, it is based on margins, it takes few samples and it achieves good results [22]. it is an effective method of machine learning that is applied to classification and regression problems. To estimate the output associated with a new input X, (KNN) consists in taking into account the K training samples whose input is closest to the new input X [23].  Decision tree: it is a method of decision making and classification, the different decisions possible are located at the terminal nodes (which represent the leaves of the tree) and are obtained according to the decisions reached at each stage [24].

Dataset
To evaluate our approach, we chose the famous NSL KDD dataset [4], which is an advanced version of KDD Cup 99 [3]. NSL KDD gathers without redundancy network traffic data from a military environment, it is composed of normal and attack records, namely:  DoS (Denial-of-Service): This makes the service unavailable.  Probe: which tries to disclose information about a network and find system vulnerabilities.  U2R (User to Root): which profit from vulnerabilities in the system to get super user privileges.  R2L (Remote to Local): which tries to attack a machine and causes vulnerabilities to obtain secure information. Tables 1-3 represent the number of records for each type. Table 1 shows the distribution of the dataset in two classes. Table 2 shows the distribution of the dataset in five classes, and Table 3 shows the distribution of the dataset in twenty-three classes.

Validation method
To assess our model, we chose the cross-validation method, it is a technique that assesses the detection capacity of a classifier by dividing the data set into two subsets, the training subset and the test subset. Firstly, the classifier is trained on the training dataset, secondly, it is applied on the test dataset in order to measure its degree of success. The process is repeated N times independently, 557 the average of the N performances is returned. The strong point of this technique is that all data is used for training and testing, which makes the assessment more precise. We employed 5-fold cross validation to assess our approach, if we increase the N, the number of attacks for some types like R2L and U2R will decrease for each subset, and they may be neglected during processing [25].

Evaluation metrics
To assess the detection efficiency of our proposed algorithms, we choose the metrics accuracy, sensitivity, specificity, false positive rate (FPR) and area under curve (AUC), the definitions of these metrics are: It represents the fraction of true identification overall data instances.
It is also called true positive rate (TPR), it measures the ratio of positive instances that are correctly classified.
It measures the ratio of negative instances that are correctly classified.
It represents the probability of falsely rejecting the null hypothesis.  AUC is the probability that a randomly selected positive instance is ranked higher than a randomly selected negative instance. It is the measurement of the surface area under the receiver operating characteristic (ROC) curve which plots the true positive rate (TPR) against the false positive rate (FPR). TP, TN, FP and FN are extracted from the confusion matrix after the classification operation, they mean respectively, True Positive, True Negative, False Positive and False Negative.  Table 4 summarizes our work environment, it presents the hardware and the software with the configuration or version.

RESULTS AND ANALYSIS
This part is reserved to present and discuss the different results obtained. We have assessed the approach according to the three distributions of the dataset NSL KDD as described above, so we carried out three types of classification, namely, classification of two classes, classification of five classes and classification of twenty-three classes. For 2-classes classification, the dataset is divided into two subsets, a subset of normal data and a subset of attack data. For 5-classes classification, the dataset is divided into five subsets, one subset of normal data and four subsets of attack data, namely, U2R, R2L, Probe and DoS. For 23-classes classification, the dataset is divided into twenty-three subsets, one subset of normal data and twenty-two subsets of attack data derived also from U2R, R2L, Probe and DoS attacks.
Our experiments were carried out in several steps, at each step we increase the number of nodes constituting the big data cluster and we store our data there, we apply the machine learning classifiers in a distributed and parallel way on the big data cluster, then we calculate the indicators accuracy, specificity, sensitivity, false positive rate (FPR), area under curve (AUC), and the processing time, in order to evaluate the performances. Figure 2 shows classification Accuracy for two classes, five classes and twenty-three classes. Table 5 shows classification metrics: sensitivity, specificity, AUC and FPR for two classes, five classes and twenty-three classes. While Figures 3 describes the    As illustrated by Figure 2, the different values reached of accuracy for all machine learning classifiers are generally very high, the KNN algorithm is very efficient with accuracy values that reaches up to 99.9% for the classification of two classes, 99.9% for the classification of five classes, and 99.8% for the classification of the twenty three classes, which means that KNN is very powerful than SVM and decision tree for identifying each type of data whatever the data distribution. For the distribution in two classes and five classes, the data is not distributed in several classes; the decision tree is more accurate than SVM with accuracy percentages of 99.8% and 99.6% for the detection of respectively two and five classes, which means that decision tree is very efficient with less distributed data. For the distribution in twenty-three classes, the dataset is more distributed; the reached accuracy by SVM method is 99.4%, which explains that SVM is more precise for data with a high distribution.
Also, as shown by Table 5, the highest Sensitivity values are those of the KNN algorithm, 100% for the detection of two classes, 92.4% for the detection of five classes and 77.7% for the detection of twenty three classes, which means that KNN can correctly identify the nature of the data more than the both methods SVM and decision tree. The values of Sensitivity which are in second position are those of SVM with percentages of 98.5% for the identification of two classes, 88.6% for the identification of five classes, and 66.3% for the identification of twenty three classes, which proves that SVM can perfectly detect the type of data, more than the decision tree method. Effectively, the percentage of the false positive rate (FPR) noted by KNN is null for any type of classification, this explains that KNN can detect without error compared to the other classifiers. And also, the values achieved of the false positive rate (FPR) by SVM are only 1.5% for the detection of two classes, 0.2% for the detection of five classes and null for the detection of twenty-three classes, which means that SVM detects with less error than decision tree. We also notice that specificity values of KNN reach up to 100% for recognition of two classes, five classes and twenty-three classes, which means that KNN can perfectly detect negative instances more than the other methods.
As represented by the Figure 3(a-c), the time processing of the algorithms decreases as the number of nodes in the cluster increases. In the case of a cluster with a single node, the training and validation time taken by KNN is 1826 s for the classification of two classes, 1856.5 s for the classification of five classes, and 1792.8 s for the classification of twenty three classes, these values decrease as long as the number of nodes of the cluster increases, until reaching only in the case of a cluster with a five node, 1667.6 s for the classification of two classes, 1611.9 s for the classification of five classes, and 1659.4 s for the classification of twenty three classes. We clearly notice the same evolution for the other two methods SVM and decision tree. Which justifies that parallel and distributed processing reduces effectively time consumption.  The experiments have proven that machine learning algorithms are very effective at detecting new hidden attacks and intrusions, and applying them in a parallel way in a distributed environment improves significantly time consumption.

CONCLUSION AND FUTURE WORK
In this study, we suggested a new approach established on the storage of the large volume and variety of network traffic data using big data techniques, and the analysis of these data using machine learning algorithms in a distributed and parallel way, in order to detect new hidden intrusions with less time consumption. To prove the validity of our approach, a big data cluster has been set up, the popular NSL KDD was chosen as dataset for the evaluation. The assessment was carried out following several steps, at each step, the number of nodes in the big data cluster is increased, the NSL KDD is stored in the big data cluster, the machine learning algorithms are applied for the analysis, and then the evaluation metrics are calculated. To support the validity of our proposal, the experiments results shown that the machine learning methods are very effective to sensing intrusion and their application in a parallel and distributed way reduces considerably time consumption. In the future, we will try to implement really a new intrusion detection system (IDS) using our new distributed approach.