Toward a deep learning-based intrusion detection system for IoT against botnet attacks

Received Jun 9, 2020 Revised Dec 30, 2020 Accepted Feb 2, 2021 The massive network traffic data between connected devices in the internet of things have taken a big challenge to many traditional intrusion detection systems (IDS) to find probable security breaches. However, security attacks lean towards unpredictability. There are numerous difficulties to build up adaptable and powerful IDS for IoT in order to avoid false alerts and ensure a high recognition precision against attacks, especially with the rising of Botnet attacks. These attacks can even make harmless devices becoming zombies that send malicious traffic and disturb the network. In this paper, we propose a new IDS solution, baptized BotIDS, based on deep learning convolutional neural networks (CNN). The main interest of this work is to design, implement and test our IDS against some well-known Botnet attacks using a specific Bot-IoT dataset. Compared to other deep learning techniques, such as simple RNN, LSTM and GRU, the obtained results of our BotIDS are promising with 99.94% in validation accuracy, 0.58% in validation loss, and the prediction execution time is less than 0.34 ms.


INTRODUCTION
Nowadays an enormous number of objects are dispatched around the world and are connected between them and to the internet. They vary from personal gadgets, wearables, sensors, actuators to home appliances and medical devices. As estimated by CISCO in 2025, it would be somewhere 75 billion devices connected to the Internet [1]. The IoT has raised concerns that are growing rapidly without fitting thought of the significant security challenges [2].
Nowadays, most of the security concerns are like those of regular servers, workstations and smartphones; however, security moves extraordinarily to the IoT, including mechanical security controls, hybrid frameworks, IoT-explicit business procedures, and edge devices [3]. Traditional security protection technology is limited due Zero-Day attacks and vulnerabilities and future new attacks that are continuously changing nature; setting up a steady, reliable, and precise intrusion detection is becoming mandatory for improving the IoT security [4].
Botnets or zombies are robots of infected internet-connected devices, they are used to achieve distributed denial-of-service attack (DDoS attack), password cracking, key logging (steal data), cryptocurrency mining, and give the attacker the possibility of accessing the device using command and control (C&C) software [5]. In September 2016, "Mirai" malware, an IoT Botnet attacked many sites offline ISSN: 2252-8938 Toward a deep learning-based intrusion detection system for IoT against botnet attacks (Idriss Idrissi) 111 like the cloud service provider "OVH" with nearly 1.1 TBps, the website of computer security consultant Brian Krebs with 620 Gbps of traffic, and many other websites like dynamic DNS provider "Dyn" [6].
The detection and prevention from different attacks are a big challenge. IDS using machine-learning methods, has gained a wide reputation [7]. IDS is an essential component in the security mechanism, it is used for the analysis and detection of the security breaches on a network [8]. IDS systems can be gathered into two categories: the first one "anomaly detection" and the second is "Misuse detection", or gathered into three major distinct families "host-based", "network-based" and "hybrid". IDS investigate both traffic in the network and in the operating systems. IDS are used for effective network protection. Numerous research works are trying to apply data mining and machine learning algorithms to cyber security. In Machine learning, patterns recognition and data mining algorithms are extensively applied to distinguish the normal traffic from the malicious one.

ARTIFICIAL NEURAL NETWORKS (ANN)
In last recent years, one of the most resulting and efficient subsets of artificial intelligent is deep learning (DL) which it is also a subset of machine learning based on artificial neural networks (ANN) [9]; a computing system inspired from biological brain where the machine learns from many training examples, allowing it to classify other examples [10]. DL is increasingly being used. It can be applied in many data processing layers into a hierarchical architecture to make a deep model. DL accounts on its capacity to identify ideal features in raw data through successive nonlinear transformations, with every alteration achieving a more elevated level of complexity and abstraction [10]. It has been applied efficiently to many different research fields, from medical image processing, natural language processing, speech recognition, and signal recognition to many other domains of science, business and government. In all these fields, DL showed tremendously promising results [11]. In the IoT security field, the machine trains on various collected and labeled attacks and also normal traffic to learn them, were finally this machine can identify new similar attacks.
Convolutional neural networks (CNN or ConvNets): it's a deep learning class developed in 1998 by LeCun in the LeNet architecture [12]. In recent two decades, CNN gained big success. It's composed of an input layer, many hidden layers in between, and an output layer as shown in Figure 1 like the multi layer perceptron (MLP) [13] networks. Best known and used layers are: convolution, activation or ReLU, and pooling [14]. The convolutional layer is the most important one, it takes a convolution kernel also called a mask or a filter then pass it over the data (usually images) and transform it based on the values from the filter as shown in Figure 1. Then it calculates the feature map values using the formula (1), where the input data is represented by and the kernel by ℎ, and are respectively the indexes of rows and columns of the resultant matrix [15].
The pooling layer it is what achieves progressively down sampling to reduce the size of the succeeding layers through max pooling or average pooling to help overfitting. Max pooling divides the input into non-overlapping clusters and selects the maximum value for each cluster in the previous layer [16]. Contrariwise to the traditional feature selection algorithms it has the capability of learning better features automatically and categorize the traffic. In addition, it can achieve better classification and learn additional features with more traffic data because it shares the same convolution matrix (kernel), that would decrease the number of parameters and calculation sum of training significantly. This gives CNN a fast recognition of attack nature, contrariwise to other deep-learning algorithms, or machine learning algorithms that can be over-fitted with massive big data. Moreover, the literature shows that using CNNs in intrusion detection field gives better results than other algorithms [17][18].
Recurrent neural networks (RNN) are a class of deep neural networks that contains feedback connections as shown in Figure 2. The fully connected layer works on a flattened input where each of these inputs is connected to all neurons. The activation function of a node describes its output given a one or set of inputs. Rectified linear unit (ReLU) activation as shown in Figure 3(a), it is a ReLU used on all elements of the volume. It aims at introducing non-linearities to the network. LSTM is composed of memory blocks that are a set of recurrent connected subnetworks. These blocks are composed with a self-connected memory cells (one or many) as shown in Figure 4 which offer a memory to remember the previous data, and three units called gates: input gate (3.a), a forget gate (3.b) and an output gate (3.c) which they provide a continuous equivalent of write, read and reset operations [21]. These gates are sigmoid as shown in Figure 3(b) and tanh as shown in Figure 3(c) activation functions meaning that their output is a value between 0 and 1 for sigmoid, and between -1 and 1 for tanh. Derived from feedforward neural networks (FNN) but unlike FNN, there are loops (bidirectional data flow) and memories to remember previous computations as shown in Figure 4 [19]. And allowing preceding outputs to be used as inputs while having hidden states [20] where for each timestep , the activation < > and the output < > are expressed: ( Where , , , , are coefficients that are shared temporally and 1 , 2 activation functions RNN can face the long-term dependency problem and the vanishing gradient & exploding gradient, so we cite here the best known RNN networks, the long short-term memory (LSTM) and gated recurrent unit (GRU) networks to solve these problems. The main difference to simple RNN is that the nonlinear units in the hidden layers are replaced by memory blocks. The following formula (3) represents the gates in LSTM [22].
Where: are the input gates (" " for the input gate, " " for the forget, and " " for the output gate); " "it is the sigmoid function; " " it is the biases for the gate(x);

ISSN: 2252-8938
Toward a deep learning-based intrusion detection system for IoT against botnet attacks (Idriss Idrissi)

113
"ℎ −1 " it is the output of the precedent LSTM block; " " it is the current input. Gated recurrent unit (GRU) is a simplified version of LSTM, where the GRU modulates the flow of information inside the unit using gating units as shown in Figure 4, without separating the memory cells [23][24]. It merges the forget and the input gates into an "update gate", also merges cell and hidden state, GRU has fewer parameters than the LSTM. It is defined by the following formulas.

RELATED WORKS
Koroniotis et al. [26] applied support vector machine (SVM), LSTM and RNN to evaluate the IoT-Dataset. The authors focused on binary classification on the dataset, and their prediction result was either a "normal traffic" or "some type of attacks" (for every type of attack), which is not helpful for implementing many models (for every attack type) to a working IDS contrary to a multi-label output (numerous categories of attacks) that gives one and only one model.
Ibitoye et al. [27] in their research compared the performance between two deep learning models self normalizing networks (SNN) and feed forward neural networks (FNN) inside the milieu of an IoT network. This comparison shows that FNN outperforms SNN, even if SNN remains better in regards to adversarial samples. Also, the authors examined the impact of feature normalization on the adversarial strength and demonstrated its bad influence to adversarial attacks resisting.
Ferrag et al. [28] in his paper conducted a comparative study with two datasets, Bot-IoT and CSE-CIC-IDS2018 datasets using some deep learning approaches, such RNN, CNN, Boltzmann machine, deep belief networks (DBN), deep Boltzmann machines (DBM), deep autoencoders and deep discriminative models, with 100 hidden layers to get an accuracy of 98.394%.
Mengmeng [29] proposed using FNN an intelligent binary and multiclass classification, but with just few classes to get 99% in all evaluation measures (accuracy, precision, recall and F1 score) for DDoS/DoS attacks while the normal traffic classification got an accuracy of 98%.
AlKadi [30] proposed a system named mixture localization-based outliers (MLO) on the BoT-IoT Dataset that uses utilizes gaussian-mixture models for fitting network data and a local outlier factor function for discovering abnormal patterns in network traffic data, and gotten an accuracy of 97.98%.
In fact, convolutional neural networks (CNN or ConvNets) are a class of deep neural networks that are used in many fields but mostly in pattern recognition. CNN is a class of neural networks that uses the convolution and the pooling layers instead of the fully connected hidden layers [31]. Contrariwise to the traditional feature selection algorithms it has the capability of learning better features automatically and categorize the traffic. In addition, it can achieve better classification and learn additional features with more traffic data because it shares the same convolution matrix (mask), that would decrease the number of parameters and calculation sum of training significantly. This gives CNN a fast recognition of attack nature, contrariwise to other deep-learning algorithms, or machine learning algorithms that can be over-fitted with massive big data. Moreover, the literature shows that using CNNs in intrusion detection field gives better results than other algorithms [17][18].
The related work listed above can provide a good prediction of botnet attacks that can affect an IoT system. However, these works could not recognize type of attack due to the binary classification. Hence our study constitutes an important experimental extension of the above-mentioned works by benchmarking the Bot-IoT dataset using CNN compared to different deep learning models, using the multilabel classification corresponding to various categories of attacks in the IoT.

PROPOSED METHOD
BotIDS is our proposed network IDS obtained by learning from the Bot-IoT dataset. This solution is planned to be placed in a fog node when it will be implemented in a real IoT environment. A such deployment gives it the power of analyzing in real time the inbound and outbound traffic through the network by sniffing it as shown in Figure 5. This location will make our BotIDS able to monitor all traffic to/from the devices both inside and outside the network, and even particularly the traffic between this insider devices in case of a lurking zombie device inside the network. The BotIDS is a deep learning-based method on a deep learning model that contains three phases: as shown in Figure 6 Int J Artif Intell No Yes

Testing dataset
The raw Dataset

Training dataset
Adjusting the parameters Testing the model Figure 6. Model building process − 1st Phase: Dataset preprocessing; First of all, we need to alter the raw data and normalize its values with the goal of the best performance of deep learning model, and then convert it into image shape. − 2nd Phase: Building the model; the model is at first fit on a training dataset (a part of the dataset) using parameters to achieve the improvement of the model performance, these parameters are changed in the training process to reach better performance. Secondly, the test dataset (the remaining part of the dataset) is used to validate the accuracy of the model. − 3rd Phase: Evaluating the model by prediction; after building and generating the model, we evaluate this model with the test dataset by predicting attacks and calculating the time needed for this prediction.

Dataset preprocessing
For conducting proposed work, we have used the latest Bot-IoT dataset [32] that was created specifically for IoT systems by an actual network milieu at the Cyber Range Lab of the Center of UNSW Canberra Cyber. The environment incorporates a combination of usual normal and bad traffic, with six types of attacks and 10 subcategories, namely, reconnaissance (service scanning and OS fingerprinting), DDoS (TCP, UDP and HTTP), DoS (TCP, UDP and HTTP), theft (key logging and data exfiltration).
With 72 million records of data traffic simulated IoT environment. The whole data was ascended down to 5% into a "full-feature" dataset with around 3.6 million records and another version called "10 best features" is also provided with selection of best features from the "Full features" version, both versions are used for our experiment. The training and test dataset have 11 output classes which reflect the normal traffic, and the 10 types of attacks which were carried out against the IoT network.
The Bot-IOT dataset contains network connection attributes; nominal, numeric and IP addresses. We convert the nominal data to numeric data, ipv4 and ipv6 also be converted to numerical shape, and merging category and subcategories fields into one field that contains 10 types of attacks and the 11th is a normal traffic then we convert the new category attribute using "one-hot encoding", and dropped the binary "Attack" field cause. Our focus is on a multilabel output and not a binary one.
After encoding the data, we normalized it using Scikit-learn; meaning scaling the vectors individually to unit norm, and converting the normalized output data into image data shape.
Then we split the data at first into data X (contains all the features except the "category" feature) and label Y (contains the "category" feature), and then split it into random training subset and testing subset with 75% for training set and 25% for the testing set. Figure 7 shows the number of data rows for each set and each attack type.

Building our models
The CNN models were defined to have an input layer with the number of neurons equal to the amount of input features, four hidden layers Convolution2D layer, MaxPooling2D layer, Flatten layer, Dense layer and an output layer. For the best features dataset, the model was trained in 10 epochs (the whole dataset is passed through the neural network 10 times) with batch size of 32 (amount of training simples in a single batch is 32) and a kernel size of (1, 10). The neural network comprises 16 input neurons (in the first layer, the same number as the features), with 4 intermediate (hidden) layers with 32 (Convolution2D), 32 (MaxPooling2D), 480 (Flatten), 22 (Dense) neurons, and 11 output neurons for the multilabel classification as shown in Figure 8. For our full-feature dataset, the model was trained in 15 epochs (batch size of 32 and a kernel size of (1, 10)), and had a 40-neuron input layer, same number and consistency of hidden layers as with the best feature model and 11 output neurons for the multilabel classification. In both cases, for the input and hidden layers, the activation function that was used was 'Relu', while the output layer activation function was 'Softmax' as shown in Figure 8 and Table 1.
The other recurrent neural network (RNN) models were defined to have one input layer with the number of neurons equal to the amount of input features, four hidden layers and an output layer. For the best features dataset, the model was trained in 40 epochs with 32 in the batch size. The neural network comprised 16 input neurons (in the first layer, the same number as the features), with 4 (SimpleRNN/LSTM/GRU) intermediate (hidden) layers with 32, 64, 128, 22 neurons, and 11 output neurons for the multilabel classification as shown in Figure 9. For our full-feature dataset, the model was trained in 40 epochs (32 in the batch size), and had a 40-neuron input layer, same number and consistency of hidden layers as with the best feature model and 11 output neurons for the multilabel classification. In both cases, for the output layer activation function was 'Softmax' as shown in Figure 9, and Table 1.

Evaluating the model
In Figures 10-11 (generated by Tensorboard) and Table 2, we present the accuracy training, loss training, accuracy validation, loss validation along with training time of all models that are trained over several epochs. We consider the number of epochs for which our model reaches the best results for each version of the dataset (full features and best features). We initially tested each model with a batch size of 128, which allowed us to get good timing for all models but with a poor performance, compared to a smaller batch size like 32 records that leads to an improved result but with a higher computation time.
As shown in Figure 10 and Table 2, CNN models in both dataset versions "Full Features" and "Best Features" reaches respectively 99,430 and 99,935 as accuracy in just 1395s and 823s (15 and 10 epochs for each). Compared to our CNN, GRU is a little more accurate but it took more time to be trained (12412s and 5818s in 40 and 20 epochs). The time of our CNN is almost twice compared to simple RNN for the "Full Features" version (6089s in 40 epochs), and almost equal to the "Best Features" version (5708s in 40 epochs). On the other hand, LSTM models have their best accuracy in a timing between simple RNN and GRU (with 10627s and 8302s in 30 and 15 epochs). For the Loss as shown in Figure 11 and Table 2, CNN models are always the best reaching models in both dataset versions, "Full Features" and "Best Features", with respectively 0,582% and 1,663% compared to other RNN models (between 1.7% and 2.9%). By examining these results, it is clear that the CNN model using "Full Features" is the best model according to its higher accuracy (0.9993), and the best time performance when considering the whole time for obtaining these results. In addition, when comparing the resulting metrics in all models with the two-dataset versions, we clearly see that the "Full Features" version gets best performance related to "Best Features" version. This means that deep learning algorithms can achieve better classification and learn additional features with additional traffic data as already explained in the previous section, but in computation time and energy consumption, it consumes more because more data to train are needed.
In Figure 12, we compared the time that a prediction would take for all the considered models (how much time the IDS would take to identify an attack) which could be another important metric for the IDS implementation in a real IoT environment. We then concluded that CNN has not only the best accuracy and the lowest loss, but also the lowest time to predict an attack with just a fragment of milliseconds (with an average of 0.34ms). The other models make a detection of an attack in more than twice or triple that time of CNN (0.78ms for simple RNN, 1.03ms for GRU, and 1.08ms for LSTM). The amount of data in the dataset is considerably unbalanced regarding the different types of attacks. For example, DoS (HTTP), DDoS (HTTP), key logging, and data exfiltration attacks in the training and test sets have very small amount of data as shown in Figure 7. Therefore, the model has a limited capability to learn accurately these types of attacks. The detection average of these attacks is one of the main factors restricting the overall detection accuracy.  Step: 47μs Total: 823s Epoch: 82s Step: 30μs Total: 6089s Epoch:153s Step: 52μs Total: 5708s Epoch:142s Step: 52μs Total:10627s Epoch:354s Step:121μs Total: 8302s Epoch:332s Step:121μs Total:12412s Epoch:310s Step:106μs Total:5818s Epoch: 291s Step:106μs ISSN: 2252-8938 Toward a deep learning-based intrusion detection system for IoT against botnet attacks (Idriss Idrissi) 119 Figure 12. Prediction timing comparison

CONCLUSION
The number of IoT objects dispatched around the world is growing progressively, which makes its security a big challenge. Our work here focused on intrusion detection systems for IoT using four variants of deep learning models, and compared them efficiently to detect various types of IoT network attacks, usually done by botnets. After several experiments, we obtained a reasonable detection rate on our all four models. By analyzing the obtained results, we concluded that CNN is the best one for intrusion detection systems, it was able to identify successfully different types of attacks and showed the higher accuracy (with 99.94%) in comparison with other DL algorithms, such as Simple RNN, LSTM, and GRU, and it allowed a detection with lower loss rates (0.58%) and a better performance in terms of prediction time. As future works, we will apply the CNN model on a real network traffic data, and reinforcing it using deep transfer learning by adding other characteristics from other datasets or from diverse firewalls, logs, and IDS servers. In addition, we will try to use self-supervised learning to generate a powerful and updated model, and then implement an autonomous intrusion detection system.