Low-rate distributed denial of service attacks detection in software defined network-enabled internet of things using machine learning combined with feature importance

ABSTRACT


INTRODUCTION
The internet of things (IoT) is a concept where various smart devices are connected via the Internet to collect and transfer data or information [1].The advancement of IoT is accompanied by efforts to modernize the global communication infrastructure that revolutionizes many aspects of life, enabling system interconnection with intelligent communication [2].Examples of IoT implementations include medical devices, medical care, driverless vehicles, industrial robots, and smart city infrastructure with remote interaction models [1].The rapid development of IoT will increase the number of smart devices connected to public networks, raising problems of complexity and security [2].Even though IoT devices are growing, IoT networks are vulnerable to availability attacks, such as denial of service (DoS) and distributed denial of service (DDoS).Such attacks can quickly attack devices connected to an IoT network.
Moreover, the use of botnets can increase the volume of DDoS attacks, which can tamper the IoT services.In addition, traditional security mechanisms tend to be unsuitable for being implemented because IoT devices have less memory, processing capacity, and power.Due to its resource-limited characteristics, IoT tends to have more vulnerabilities that attackers can easily exploit [2].This raises concerns about the security risks of IoT networks caused by the large-scale incorporation of smart devices.Due to the rapid development of IoT, there are more and more efforts made by attackers to find loopholes to infiltrate the IoT network.Lowrate distributed denial of service attack (LRDDoS) is a serious threat to IoT infrastructure networks, among many attacks.LRDDoS attacks present an ongoing threat to almost every internet service as they attack server resources and can also potentially bring down the network.In addition, the main challenge with detecting LRDDoS attacks is the complexity of the attacking pattern.Massive traffic analysis will significantly consume the use of computing resources and even increase the risk of memory overflow.IoT devices integrated with software defined network (SDN) [3], namely SDN-Enabled IoT, can significantly reduce the amount of computing overhead and provide additional security [4].IoT aims to distribute data, and SDN provides services for network management by separating the control and data plane.However, because of this separation, the controller becomes a vulnerable target for cyber security attacks.Among all of the possible attacks, the availability threat may direct its attack to the controller by overwhelming the node using flooding, namely DDoS.In response, the controller will process every unwanted packet from the attackers.If the controller crashes, the entire network will collapse [5].DDoS attacks are categorized into Flood and Shrew according to their characteristics and attack speed.Among them, Flood attacks are divided into high-rate (DDoS attacks with massive delivery rates) and low-rate (which are included in Flood attacks, but the transmission speed is less than 1,000bps).Their division is based on the packet transmission speed [6].In the SDN-Enabled IoT network, attacks will occur at several levels, such as HRDDoS attacks in the control plane and LRDDoS attacks in the data plane.Attackers launch high-rate DDoS attacks at the SDN control layer by sending large amounts of useless data to weaken controllers and network resources.
A controller running out of resources will cause the entire SDN network to crash.However, HRDDoS attacks on controllers have traffic characteristics that are easy to identify, which can be pointed out by the significant rise in traffic amount in a short period [7].In contrast, the LRDDoS are hard to detect because it has the same characteristics as regular traffic.So, the general DDoS attack detection mechanism (statistics) is ineffective in detecting LRDDoS because deep packet inspection (DPI) should be performed in order to retrieve the detailed information on the packet's header [8].Unlike HRDDoS attacks, LRDDoS generates very little attack traffic and is stealthy.With a slow and inconspicuous process, LRDDoS allows the target system's performance to decrease gradually until it completely fails [9].Low-Rate DDoS attacks are in the form of periodic pulses, where the attacks sent are concentrated.The average attack traffic is small but carried out repeatedly so that it can reduce the quality of service [6].LRDDoS has the same characteristics as a normal network in the data center: low delay, diversity, and synchronization [10] so LRDDoS will not be easily detected if its characteristics match normal traffic.Low-rate DDoS attacks target the data layer with small attack traffic levels.Attackers can take advantage of it to launch LRDDoS attacks that hide in normal data streams and are difficult to detect with traditional methods.
Several studies have been conducted to detect LRDDoS attacks on SDN and SDN-Enabled IoT networks.Altamemi et al. [11] proposed a method for classifying DDoS attacks which include either high-rate or low-rate attacks based on real-time traffic datasets using machine learning method (Gaussian Naïve Bayes (GNB), logistic regression (LR), and decision tree (DT)).The research outcomes showed that DT could produce better accuracy than the other algorithms by gaining 99.9%.However, this paper did not use the appropriate dataset extracted using OpenFlow protocol in order to provide better data classification.Wani and Revathi [12] proposed a ransomware detection system in an IoT environment integrated with SDN, namely IoTSDN-RAN.The classification was performed by inspecting the constrained application protocol (CoAP) packet received by the controller using a combination of GNB and principal component analysis (PCA).The results indicated that the proposed method could predict ransomware traffic, proven by the accuracy pointed at 97.91%.John and Nagappasetty [13] investigated the detection scheme for detecting a Slowloris attack with slow bandwidth traffic aimed to simultaneously open a hypertext transfer protocol (HTTP) connection between the attacker and the targeted server.The authors utilized a statistical approach by extracting the flow statistic provided by OpenFlow.However, the results indicated that the statistical approach did not detect the attack as faster as the Machine Learning approach, proven by the detection time pointed at 260s.Research conducted by Azmi and Sumadi et al. [14] aims to detect LRDDoS using the support vector machine (SVM) combined with feature importance using logistic regression (LR) [15], [16].Feature importance is useful for sorting the features contained in the OpenFlow protocol to ease the controller's classification process.The best accuracy is found in SVM with Linear Kernel, with accuracy reaching 100%.However, in terms of training time, linear SVM takes about 23.6 seconds, while SVM with kernel radial basis function (RBF) is much faster, which is only 1.5 but with lower accuracy results, and the average accuracy only gets 74.3%.
Cheng et al. [7] researched machine learning to detect LRDDoS attacks on SDN-Enabled IoT networks.In this study, the researchers tried to overcome one of the LRDDoS, shrewattack.The features used in this study are taken from features extracted from the OpenFlow protocol and are divided into 2, namely stateless and stateful.These researchers used several algorithms: SVM, the multinomial Naive Bayes algorithm (NB), random forest (RF), and K-nearest neighbors (KNN).The dataset used is 204,888 packets containing synchronize transmission control protocol (TCP SYN) packets, repeated TCP transmissions other than normal.The number of normal data packets is 48,509, including hypertext transfer protocol secure (HTTPS), HTTP, internet control message protocol (ICMP), and message queuing telemetry transport (MQTT).The RF algorithm obtains the highest accuracy value with an accuracy rate of 97% and has the best effect on the switch.
Maslan et al. [17] conducted a similar study by combining linear regression models (ANOVA) in the feature reduction process to increase the effectiveness of the classification process using machine learning.In addition, the dataset used in this study is the result of extraction in a test bed environment and has not used the SDN architecture.From the results obtained, RF is the best algorithm in the classification process, with an accuracy value of 98.70%.Khempetch and Wuttidittachotti [18] employed the deep learning method for detecting DDoS, specifically using deep neural network (DNN) and long short-term memory (LSTM).The results indicated that the algorithm could successfully classify the attack, proven by the accuracy value pointed at 99.97% on average.Huraj et al. [19] stated that IoT integrated with manufacturing processes could potentially threaten DDoS attacks.Researchers describe case studies of IoT device applications and show the vulnerabilities of these devices.In addition, the researcher proposes to use sample Flow (sFlow) to detect and protect against DDoS attacks during production using machine learning.
In a study by Pande et al. [20] DDoS detection was carried out using machine learning techniques, and the algorithm used for model training was RF, resulting in an accuracy value of 99.76%.Alashhab et al. [21] found that machine learning is the proposed most effective LRDDoS detection mechanism in addition to other detection techniques.The researchers divided the LRDDoS detection mechanism categories based on machine learning into classification-based and deep learning-based.Wang et al. [22] in their research, explained that DDoS attacks are not only centered on the data plane but also in the control plane, causing fluctuations in the number of flows.In this study, the researchers built a DDoS attack with a separate SDN architecture and a new model to define the attack flexibly.The detection model used by the researcher is supervised learning.At the testing stage, the models that produce the highest accuracy values are decision tree (DT), KNN, and bagging tree (BT), with values above 90%.However, the sample used in this study is still lacking to get better accuracy results because it only uses one feature.
Based on previous research, it can be concluded that machine learning is an effective method of detecting LRDDoS attacks.However, no authors provided a thorough analysis of performing the LRDDoS detection using minimal resources in an IoT environment and maintained its datasets to conform with the OpenFlow standard.In this study, the solution proposed by the author to deal with LRDDoS attacks is an integration of SDN-Enabled IoT with machine learning combined with Feature Importance.Machine learning has the function of creating models that are used in the classification process by the controller.The model generation process is combined with three feature importance methods, namely LR, random forest classifier (RFC), and random forest regression (RFR), to reduce the number of features so that the load received by the controller will be reduced because the resources used are only the relevant features.The model goes through a training process using eight different algorithms, including SVM with linear kernel and RBF, RF, DT, multilayer perceptron (MLP), GNB, AdaBoost (ADB), and KNN.Each model used in the classification process will produce accuracy, precision, recall, F1-score, and classification-loss values from each algorithm.The contribution given in this research is performing LRDDoS detection utilizing several supervised algorithms combined with three different Feature Importance methods for computational reduction in the classification process and adjusting the dataset of LRDDoS with the OpenFlow protocol based on the port statistic.Adjusting the dataset will also significantly improve the accuracy of the detection mechanism since the features were easily extracted on the controller.In addition, this study also compares which algorithm is the most appropriate for detecting LRDDoS attacks from each Feature Importance method.

RESEARCH METHOD 2.1. Emulation's topology and scenario
In this study, the test was operated on an Ubuntu 20.04 LTS computer with a specification of Intel® Core™ i5-10400 CPU @ 2.90 GHz, 8 GB of RAM, and 240 GB of SSD.The SDN-Enabled IoT network topology was emulated by the Mininet emulator [23].Based on Figure 1, the components used in the network architecture in this topology consisted of 7 Open vSwitch (OvS) [24], [25], 1 RYU Controller [26], and 8 Hosts.The applied topology was a tree with configuration variables of depth=3 and fanout=2.In the topology that had been developed, h1 acted as an attacker, and h6 acted as a victim and a CoAP server [27] with a logical address of 10.0.0.6:5683.As an attacker, h1 overwhelmed the topology using the TCPReplay tool [28] 3 times with different packet transmission speeds, consisting of 20, 50, and 70 packets per second (PPS).
The attack carried out by h1 was sent via a *.pcap file containing dummy packets.In each of these packets, the IP and MAC source addresses were composed of values that were randomly generated in as many as 39,994 packets using the CoAP (POST) protocol.Packet header information that went to OvS was processed according to the rules defined by the controller.If there was no matching header, the packet was detected as a Figure 1.Emulation topology

LRDDoS dataset
The dataset used in this study utilized the OpenFlow protocol to investigate the impact of LRDDoS attacks [29].The data was generated by crafting a CoAP packet using Scapy and transmitting both normal and LRDDoS packets using TCPReplay.The controller extracted the receiving packet using the OpenFlow protocol described in Table 1.All available hosts on the network perform normal traffics, which is directed to communicate using a CoAP POST message to the server (h6).In contrast, the LRDDoS packets contain dummy packets composed of randomly generated source addresses.This data was divided into two parts: a training dataset of 160,006 packets and a test data of 39,994 packets.If totaled, the total dataset was 200,000 packets.The composition in the dataset had a 1:1 ratio for LRDDoS and normal packets.
The packets were extracted using the OpenFlow feature, including standard headers on IPv4, UDP/TCP protocols, and OvS port usage statistics.The total number of features contained in the OpenFlow protocol was 21.This number could still be excessive, burdening the controller in the classification process.In order to reduce this burden, it was necessary to simplify the number of features used, using feature selection based on the coefficient score of feature importance.The Feature Importance method used in this study includes LR, RFC, and RFR.The results of the LR and RFC processing obtained eight features, while RFR only used two features, which could improve the training model's performance and reduce the workload on the controller.The results of the calculation of feature importance can be seen in Table 1.Some features marked "-" were not used in the model generation process because they were equal to 0. Features taken from feature importance only had a value greater than 0 or less than 0. Features with a coefficient value of 0 would not affect the evaluation variables originating from the classification process based on the generated model.The comparison between the techniques used in LR and RF on the Feature Coefficient was that the LR method was calculated with all features as input in the model, while the RFC and RFR calculated the coefficient separately for each feature [30].

Model generation and classification process
Figure 2 shows a system block diagram that includes the feature reduction process without eliminating information that is considered essential or relevant based on the value of the coefficient score inputted into each feature to be predicted.This feature reduction process used three Feature Importance methods, LR, RFC, and RFR, followed by model generation, which employed eight different algorithms.All features in the training set would have a coefficient value, and their relevance to the classification process was assessed.In Table 1, the relevant features are shown with a positive or negative value, while those with a value of 0 are removed because they have no significant impact on the classification process.The selected features from each Feature Importance would later be used in the training stage of the classification model with the SVM Linear, SVM RBF, RF, DT, MLP, GNB, ADB, and KNN.This stage generated a classification model used by the SDN- Enabled IoT controller application to detect attacks from incoming packets.The classification process performed by the controller will be faster because it uses fewer resources by selecting the most relevant features based on the coefficient score.Therefore, the controller did not thoroughly extract all of the 21 features.The use of Feature Importance also prevented a decrease in the quality of the model.After the model was completed, the model was used as a classification model on the SDN-Enabled IoT controller.The model was added to the simple_switch_13 application that already existed on the RYU controller.1979 controller for learning.The model that was formed from 8 algorithms and 3 Feature Importance functioned to classify packets into the LRDDoS type or normal packets.The classification results were stored in a file for measuring the level of effectiveness using the accuracy, precision, recall, and F1-score.The data was compared with the original class in the testing set of each classified packet.In the classification process, some data was not successfully classified because the link on the OvS was overwhelmed.This condition could be measured by calculating the classification loss value from the total of all successfully categorized packets.
Figure 3. Classification process in SDN-enabled IoT

Feature importance and reduction
A large number of features can impact the controller because the greater the resource, the greater the burden the controller receives.Therefore, it is necessary to have a feature reduction process to reduce the number of features that will be used in the classification process by selecting the most relevant features and removing features that will not be too useful for the model to be trained and can even reduce the quality of the model.The feature selection process in this study applied three different Feature Importance methods, namely LR, RFC, and RFR.In this study, the dataset used has a total of 21 features.With the Feature Importance method, only certain features will be used in the classification process to reduce the overload received by the controller.
The feature importance score can be calculated for problems involving the prediction of numerical values called regression and problems involving the prediction of class labels called classification.The Feature Importance method selects features based on the results of a positive and negative coefficient score.This coefficient score can provide a fundamental basis for the feature score that is considered essential.Features with a coefficient score of 0 will be removed to prevent poor model quality.The coefficient score is obtained from the results of entering the score into the input feature for a predictive model that indicates the relative importance of each feature when making predictions.As shown in Table 2, after selecting features using Feature Importance, only eight features have a coefficient score of 0 out of a total of 21 features selected using the Feature Importance LR and RFC methods, while for RFR, there are only two relevant features.In terms of the selected features, the feature affects the prediction model because it has a coefficient score other than 0, which indicates that the feature plays an essential role in the classification process.

Training result in SDN-enabled IoT
Based on Table 3, in the classification model training process, it can be seen that the LR, among several other models, has perfect results with an average value of close to 100% for accuracy, precision, recall, and F1-score.The GNB model was superior because it was also considered the fastest in performing the training than the other models.The RF model produced the worst results, with an accuracy ratio of 89%, 79% for recall, and 88% for the F1-score.In the process of learning the model, the GNB model took time faster than the RF.The GNB model required about 0.031 seconds to train data, while the RF consumed 0.211 seconds.It can be seen in Table 4 that GNB also obtained perfect results of accuracy, precision, recall, and F1-score with a difference in the training data time of about 0.022 seconds, while the RFC model also produced the lowest results among other models.
In the RFR training test, the GNB model became the best and fastest among the previous two methods, as illustrated in Table 5.The GNB model obtained an average result of 100% for accuracy, precision, recall, and F1-score, with the training time data getting 0.013 seconds.In comparison, the RFC and DTC models produced a 10% lower accuracy difference, 20% lower recall, and an 11% lower F1-score.Regarding the time training, there was a difference between RFC and DTC.DTC performed faster within 0.017 seconds, while RFC consumed 0.158 seconds for training data.Based on the results of these tests, it can be concluded that the GNB model is the most effective and fastest model for the three coefficient score calculation methods because it works very well for large amounts of data and does not require a long time in the data training process.

Classification result in SDN-enabled IoT network
Table 6 is the result of research emulated in SDN-Enabled IoT using the LR feature, which is accumulated from three packet sending rates, including 20, 50, and 70 pps.The highest scores on the accuracy, precision, recall, and F1-score originated from SVM Linear, DTC, MLP, GNB, and ADB, with an overall score of 100%.Other models, namely SVM, RBF, and KNN, get the lowest results.Table 7 shows the classification results using the RFC model with the same attack data delivery speed (20, 50, and  From the three Feature Importance models, it could be concluded that SVM Linear, GNB, and ADB were the best algorithms because they had accuracy, precision, and recall reaching an average of 100% the classification loss was different for each delivery speed.The classification loss variable arose because the controller experienced overlapping data reception so that the testing set was not sent or the incoming data was received more than once.Receiving the same packet repeatedly would cause the classification value to increase because the number of classified data was less than the total test data sent.This pattern could happen because the emulator on mininet-IoT was unstable.

CONCLUSION
Feature Importance allows us to understand the relationship of features with target variables, as well as understand which features are relevant and which are not for the model to be built.In addition, when conducting model training, the coefficient score becomes the basis for selecting features to reduce the model's dimensions and save resources to be used.This clearly can improve the performance of the model and controller in carrying out the classification process.Based on the analysis of the test results that have been carried out, the GNB algorithm is the best model for the classification process against LRDDoS attacks because it obtains a fast training time value and also the results of accuracy, recall, precision, and F1-score values in the range of 100% during the model training process.In the Feature Importance method, LR, RFC, and RFR each have a training time of about 0.031 seconds, 0.022 seconds, and 0.013 seconds, respectively.Three models have dominant results in the classification test with SDN-Enabled IoT, including ADB, SVM Linear, and GNB.However, compared to the ADB and SVM Linear models, although they both produce perfect results, if we analyze it in comparative testing without and with SDN-Enabled IoT, the GNB model is superior in all aspects.This is possible because the selected feature has independent properties from other features.In addition, the amount of data that is processed after processing the feature selection also has an impact on reducing the complexity of the data used in the classification process.In future research, the author plans to develop a dataset model that is more effective in handling availability cases while at the same time incorporating statistical techniques in the attack detection module.

Int
distributed denial of service attacks detection in software defined … (Muhammad Abizar) 1975


ISSN: 2252-8938 Int J Artif Intell, Vol. 12, No. 4, December 2023: 1974-1984 1976 ISSN: 2252-8938  Low-rate distributed denial of service attacks detection in software defined … (Muhammad Abizar) 1977 new packet and would be processed by the controller directly for network learning purposes.Because the data sent by the attacker was composed of random source addresses, it could indirectly interfere with the controller's performance.If the controller could not withstand the load, this attack could collapse the SDN-Enabled IoT network.

Table 1 .
Feature list

Table 2 .
Selected features based on feature importance methods

Table 6 .
70 pps).The table shows how the impact of classification loss.The accuracy will increase if the loss value is high because fewer data Low-rate distributed denial of service attacks detection in software defined … (Muhammad Abizar) 1981 are processed compared to the overall testing set.In the delivery range of 70 pps, the classification loss value produced results above 50%, which increased the accuracy value for all classification algorithms.The highest accuracy, precision, recall, and F1-score values were found in SVM Linear, DTC, GNB, and ADB, which were pointed at 100% overall.In contrast, the lowest value was generated by SVM RBF, MLP, and KNN.The results of LR in SDN-enabled IoT

Table 7 .
The results of RFC in SDN-enabled IoT Another SDN-enabled IoT research result is RFR, as seen in Table8.Only RF and DT have different values.This was because RFR only selected two features.The variable values of accuracy, recall, and F1-score in other models (SVM Linear, SVM RBF, MLP, GNB, ADB, and KNN) had the same overall value of 100%.

Table 8 .
The results of RFR in SDN-enabled IoT