Feature selection for DDoS detection using classification machine learning techniques

ABSTRACT


INTRODUCTION
The computer security system is a factor that needs to be considered in the era of industrial revolution 4.0, namely by preventing various threats to the system, as well as being able to detect and repair due to any damage that occurs. According to [1], broadly the threat to information systems can be divided into two types, namely active threats and passive threats. Active threats include fraud and crimes against computers, while passive threats include system failure, human error, and natural disasters. System failure states failure in component equipment such as hard disk or computer network itself. From this concept, computer-based systems and networks sometimes become vulnerable to fraud and data theft. One type of attack that still exists and difficult to stop is the Distributed Denial of Services (DDoS) attack. This attack is carried out by making many requests for a site or website server so that the system becomes stuck and cannot function at all. Another attack that is also very dangerous is a sniffer attack technique. This technique is implemented by creating a program that tracks someone's data packet when the packet crosses the internet, captures passwords or captures contents. And that is not less important is the technique of spoofing attacks, this technique is done by falsifying e-mail addresses or the web in order to trap users to enter important information such as passwords or credit card numbers. Of the various types of learning, this study focuses on DDoS attacks.
DDoS attacks in Indonesia have been increasing recently. Data showed that 79% of total DDOS attacks in the fourth quarter of 2017 were intended for game applications. Actually, the figure fell three percent compared to the attack in the previous quarter. While telecommunication and internet applications increased to 6% from only 3%, also the application of financial services rose 2% to 4% in the last quarter of 2017. DoS is a form of DoS attack when an attacker makes the network inaccessible (slowing down or losing data) by attacking using more than one Protocol (IP) Internet address. This causes a flood of traffic making it difficult to identify the attacker. DDoS attacks are very detrimental both operationally and financially. In the B2B International survey in collaboration with Kaspersky Lab, entitled Global Corporate IT Security Risks 2015, it can identify that a DDoS attack on an online resource can cause financial losses starting at the US $ 53-417 thousand.
To anticipate attacks by network security, researchers always looking for the best techniques for detecting DDoS attacks, such as research conducted [2], how to detect DDoS attacks by developing statistical-based DDoS detection systems using Multivariate Correlative Analysis (MCA). MCA uses the Triangle-Area-Map (TAM) representation technique to describe the relationship between each traffic feature by calculating the distance of one feature value to another feature value for each feature extracted. Data from MCA processing results were analyzed by using Mahalanobis Distance to be used as reference or observation data. The detection process of the observed threshold-based data from the reference data and the anomaly classification process using Mahalanobis Distance and Cosine Distance to calculate the distance between the values of the TAM traffic feature observed with the TAM reference traffic. System testing was done by measuring the accuracy of the algorithm, based on the results of the system with parameters Detection Rate (DR), False Positive Rate (FPR) and Accuracy (ACC).
In research [3], in his research developed a detection method by looking at DDoS attack patterns using network packet analysis and utilizing machine learning techniques to study DDoS attack patterns. In his research, to analyze a large number of network packages provided by the Applied Internet Data Analysis Centre and implement a detection system using Vector Machine Support (SVM) with radial (Gaussian) kernel basic functions. Accurate detection system for detecting DDoS attacks. While the results of the study [4] explained that the attackers (hackers) can do more DOS attacks with zombie hosts (computers that have been injected with the remote control script/botnet) on targets distributed and simultaneously so that the effect of this attack is an ability to knock out the target quickly. Based on a number of studies, the CUSUM algorithm is recognized as having an accuracy point that is quite reliable in detecting DDOS attacks that often occur today. UDP Flood attacks also dominated several major attacks in the world. Based on the problem of the fact that the UDP flood dominates the current attacks, the author wanted to create an IDS (Intrusion Detection System) using the CUSUM algorithm. It is expected that the application of the CUSUM algorithm on the IDS system is able to detect UDP Flood attacks by approaching high accuracy and fast detection time. In research [5] aimed to develop a new approach to detect DDoS attacks, based on network logs that were statistically analyzed with the function of the neural network as a detection method. Training data and testing were taken from CAIDA DDoS Attack 2007 and independent simulations. Testing of statistical analysis methods on network logs with neural network functions as detection methods resulted in an average percentage of recognition of three network conditions (normal, slow DDoS, and DDoS) of 90.52%. The new approach to detect DDoS attacks was expected to be a complement to the Intrusion Detection System (IDS) system in predicting DDoS attacks.
In research [6][7] the byte level analysis of HTTP traffic offers a practical solution to the problem of network intrusion detection and traffic analysis. Such an approach does not require any knowledge of applications running on web servers or any pre-processing of incoming data. In this project, he applied three N-gram based techniques to the problem of HTTP attack detection. The goal of such techniques was to provide the first line of defense by filtering out the vast majority of benign HTTP traffic. This technique in terms of accuracy of attack detection and performance. Techniques provide more accurate detecting and are more efficient in comparison to a previously analyzed HMM-based technique.
Research conducted by [3] developed an intelligent system for detecting DDoS attack patterns using network packet analysis and utilizing machine learning techniques to study DDoS attack patterns. In this study, Klyuev analyzed a large number of network packages provided by the Applied Internet Data Analysis Centre and implemented a detection system using SVM with a radial Kernel (Gaussian) base function. This research prepared three types of datasets that Klyuev used with three and five features. Detection system was more than 85% accurate with all types of datasets and 98.7% accurate with five features. The strategy for developing DDoS attack detection systems showed that system detection with SVM was trained using the proposed feature to successfully detect DDoS attacks with high accuracy. In [8] that Fast Entropy and flow-based showed a significant reduction in computational time compared to conventional Entropy computation while maintaining good detection accuracy. The network traffic was analyzed and fast the entropy of requests per-flow was calculated. The DDoS attack was detected when the difference between the entropy of flow counts and the mean value of entropy in that time interval was that the threshold value was updated adaptively based on traffic pattern conditions to improve the detection accuracy. In detecting DDoS attacks this research proposed three methods, namely fast Entropy, flow aggregation, and adaptive Threshold.
In [9] this paper, he collected a new dataset that included modern types of attacks, which were not used in previous research. The dataset contained 27 features and five classes. A network simulator (NS2) was used in this work because NS2 could be used with high reliability and reasonable results that reflected a real environment. In [10][11][12] Attack or intrusion into a system is something that is almost certainly happened in the world nowday of information technology. To overcome this, there are several technologies that can be used, such as firewalls or intrusion detection systems (IDS). Unlike firewalls that only inspect incoming packets based on IP address and port, IDS work by monitoring the payloads of the packet that come into a computer to then decide whether the incoming packet is malicious or not. An example of IDS application is Snort IDS, an open-source application that uses string matching to detect malicious activity. One weakness of string-matching IDS is the occurrence of a string in a packet must be an exact match, just a slight difference can make an attack comes undetected, making it difficult to detect attacks that have similar flow but different pattern. Therefore, this paper proposed an intrusion detection method using n-gram and cosine similarity to seek similarity of a couple of packet sequences, thus the searching is conducted by looking for the similarity between payload and existing signature. In contrast to Snort, those packets are not matched with the pattern of attacks, but rather the pattern of legitimate access to a web page done by legitimate users, so packets that have a high similarity are regarded as benign, while the low ones will be regarded as an attack. From the test results with a different value of the threshold, then we obtained the value of 0.8 with n = 3 gave the best accuracy. This intrusion detection system is also capable of detecting various types of attacks without having to define existing attacks in advance, making it more resistant to zero-day attacks.
According to the research conducted by [5], [13][14] that Distributed denial-of-service (DDoS) is an attack-type in which volume, intensity, and mitigation costs continue to rise with a growing scale of organization. This study has the objective to develop a new approach to detect DDO attacks, based on the characteristics of network activity using a neural network with the functionality of fixed moving average windows (FMAW) as a detection method. Data were taken from the training and testing of DDoS Attack Caida 2007 and standalone simulation. Testing of methods produced the detection percentage of three network conditions (normal, slow DDoS, and DDoS) amounted to 90.52%. A new approach in detecting DDS attacks, a system that predicts the occurrence of DDS attacks.
In [15]. This study classifies network traffic information which contains botnets using the K-Nearest Neighbour algorithm. The algorithm calculates the distance on each feature in the dataset and then identifies the type of flow based on the majority of certain neighbor values (k values). The test results in this study are 92.57% where the k value is determined according to the system default, namely 5. The best k value in this study cannot be determined because the test is done to determine the value of k to get a result with a difference in value that is quite far.
From the problems that have been described, the problem to be solved in this study is to address the number of features in the dataset so that it can find out the number of features that are most important in detecting DDoS attacks. To find out the level of detection of DDoS attacks, this study uses Classification Machine Learning Algorithms such as Naive Bayes, neural networks, SVN, KNN, and Random Forest. Of the five algorithms used, the expected end result is to be able to compare which algorithm is most accurate in detecting DDoS attacks with the features selected.

RESEARCH METHOD
In this study the dataset used was data obtained from research [9], the dataset in the study was 734,627, while in this study the dataset used for training was 5899 and for testing as many as 1770. The steps in this study were as follows and Figure 1 shows research process:  Data collection is carried out in an on-going network that is captured using Wireshark.  The data is then converted to CSV  Feature Selection model regression  Attributes that are not used will be fixed; attributes that are not used will be removed.  After that, an analysis using a data mining tool will be analyzed and use some algorithm machine learning  Figure 1. Research process From this step, it can be seen that the classification technique used to detect DDoS attacks (Smurt Attack, UDP Flood, SQL Injection and HTTP Flood) and Normal Packets uses 5 classification models namely Naïve Bayes, Neural Network, SVM, KNN, and Random Forest. To find out the accuracy of detection, the parameters TF, FP, Recall, Precision, and F-Measure are used.

Dataset
The features used in this study were 25 features obtained from the results of a real-time attack simulation carried out for 3 days with 4 hours each visit. When simulating the number of feature attacks as many as 27 features [16], then extracting them using the Canadian Institute for Cybersecurity CICFlowMeter-V3 online application, because there are too many features, features that have the appropriate value are expected to be removed. The method used to maximize the features using the Regression model with SPSS applications, thus the features used for the training and testing process in this study are as follows on Table 1.

Feature selection
To find out the most optimal feature value in detecting DDoS attacks, dataset analysis is used to use linear regression with the forward method. In terms of mutual information, the purpose of feature selection is to find a feature set S with m features {xi}, which jointly has the largest dependency on the target class c. This scheme, called Max-Dependency, has the following formula [17]. (1) Obviously, when m equals 1, the solution is the feature that maximizes I{xj;c} (1<= j <= M}. When m > 1, a simple incremental search scheme is to add one feature at one time: given the set with m-1 features, Sm-1, the mth feature can be determined as the one that contributes to the largest increase of I{S;c}.

Algorithms machine learning 2.3.1. Naïve bayes
Naive Bayes Classifier is a collection with a statistical model for calculating classes that have each group of attributes that exist, and determine which class is the most optimal. In this method, all attributes will contribute to decision making, with the same important importance weights and each attribute is independent of each other [18]. The equation of the Bayes theory is: X: Data with classes that haven't known H: Data hypothesis is a specific class P(H|X): The probability of hypothesis H is based on condition X (prior probability) P (H): Probability of hypothesis H (prior probability) P (X|H): Probability X based on condition on the hypothesis H P (X): Probability X

Random forest
Random forest is an ensemble learning method that was first proposed by [19] which is a combination of classification trees in such a way that each tree depends on the random value of the sample vector independently and with the same distribution for all trees in the forest. Random Forest has been widely used both for classification and regression because of its superior performance and simple structure. To handle unbalanced data, the RF algorithm undergoes a slight modification in the selection of training data, namely by balancing the number of records in the major and minor classes. This technique is called Balanced Random Forest (BRF).

Neural network
Neural Network has many advantages compared to other calculation methods, namely the ability to acquire knowledge even if there are disturbances and uncertainties. This is because the neural network can generalize abstraction and extraction of statistical properties from data. In addition, the neural network also can present capabilities in a flexible manner; a neural network can create its own representation through self-regulation or self-organizing skills. And there are many other advantages possessed by the neural network itself. The Figure 2 for architecture neural network:

Support vector machine
The concept of SVM can be explained only as an attempt to find the best hyperplane 2 that functions as a separator of two classes in the input space. Figure 3 shows several patterns that are members αi is Lagrange multipliers, which are zero or positive (ai≥0). The optimal value of the equation can be calculated by minimizing L against w and b and maximizing L against αi.

K nearest neighbour
The K-Nearest Neighbor algorithm is a method that uses a supervised algorithm [20]. K-Nearest Neighbor includes instance-based learning groups. The K-Nearest Neighbor algorithm is simple, works based on the similarity of the test sample to the training sample (training sample) to determine the K-Nearest Neighbor [21] K-Nearest Neighbor is done by finding groups of k objects in the training data the closest (similar) to the object on new data or testing data [22]. K Nearest Neighbor is a simple classification technique, but it has good work results [23]. In general, to define the distance between two x and y objects, the Euclidean distance formula is used in the following equation: KNN has several advantages, namely toughness to training data that have a lot of noise and is effective when the training data are large. Meanwhile, the weakness of KNN is KNN need the value of the parameter k (number of closest neighbors), unclear distance-based training on what type of distance to use and which attributes should be used to get the best results, and computing costs are high because calculations are needed distance from each query instance in the whole training sample [15].

Evaluation metrics
Effective detection is the crux of our work; the wrong detection can prevent genuine packets from reaching their destinations. We want to calculate the accuracy of our detection mechanism for genuine and attack traffic and then compare it with other similar research that has reported accuracy. The performance of the classifiers is evaluated, and comparative analysis has been carried out. Classification accuracy is used as a primary performance measure for evaluating the classifiers and is measured as the ratio of the number of correctly classified instances in the test dataset and the total number of test cases. The performances of the trained models are evaluated based on the criteria of precision, recall, f-measure and accuracy using 10-fold cross validation [24]. formula for calculating accuracy is shown in (1)  Table 2 shows the analysis results are obtained. To detect DDoS attacks, the ideal features are packet delay, packet origin (from the node), destination packet (to node) and source IP Address. Of the 25 attributes contained in the dataset, only 4 attributes can be used to detect DDoS attacks, whereas the rest did not meet the criteria to be used as a tool for Classification in Machine Learning Techniques. To find out the value of R Square on each attribute can be explained in Table 3 as follows.  It can be explained that each attribute has a sig value less than 0.05 (0,000 <0.05), meaning that the PKT_DELAY, FROM_NODE, TO_NODE, SRC_ADD attributes are very significant in detecting types of DDoS attacks such as Smurt Attack, UDP Flood, SQL Injection and HTTP Flood).

Algoritma machine learning
From the attributes that have been selected, training and testing are carried out on a dataset with 5 algorithms in accordance with the methods that have been determined can be seen in the Table 4 and   Figure 4. Accuracy detection graphic The highest level of accuracy for detecting DDoS attacks is using the Random Forest algorithm and the Neural Network of 98.70%.

CONCLUSION
In this paper, we put together a new dataset that covers the types of modern attacks, which were not used in previous studies. The dataset contains 25 features and five classes. Attacks are carried out directly to the target server and capture packet data using a high-trust Wireshark application because of its ability to produce valid results that reflect the real environment. Collected data has been recorded for various types of attacks that target the network Application layer. From the datasets, there is a training and testing of data using five techniques classification: Neural Network, Naïve Bayes and Random Forest, KNN, and Support Vector Machine (SVM), datasets processed have different percentages, with the aim of facilitating in classifying. From this study it can be concluded that from the five classification techniques used, the Forest random classification technique achieved the highest level of accuracy (98.70%) with a Weighted Average of 98.4%. This means that the technique is able to detect DDoS attacks accurately on the application that will be developed