Machine learning-based intrusion detection system for detecting web attacks

ABSTRACT


INTRODUCTION
Recent technologies such as artificial intelligence, big data, and internet of things.have made our lives exponentially dependent on the internet.Alongside this, however, the number of anomalous behaviours is also becoming increasingly important [1], [2].Detecting anomalous network activity is a critical cybersecurity task that is becoming more and more of a focus, especially as we rely more and more on computers and smartphones in recent years [3]- [5].Due to the global pandemic, we can say that our daily lives are shifting to the internet, which makes security issues more complicated than before.To detect abnormal activities in a computer or network, there is a special security device called network intrusion detection system (NIDS) [6], [7].
Intrusion detection systems (IDS) are security tools that aim to defend a system, perform countermeasures or generate alerts for a facility to take appropriate action when an attack occurs [8].An IDS may be a software or hardware system designed to detect malicious actions on computer systems to enable the maintenance of system security [9].The main purpose of IDS is to detect various types of abnormal network traffic that cannot be detected by a simple traditional firewall [10].This is critical to achieve solid protection against malicious acts that compromise the availability, integrity or confidentiality of computer  ISSN: 2252-8938 Int J Artif Intell, Vol. 13, No. 1, March 2024: 711-721 712 systems [11], [12].These systems can also be targeted at different areas.Some IDS play an extended firewall role and detect attacks as they enter the network, others monitor the network internally to intercept intruders or even collect network-wide information for central analysis.Most of these systems have a similar structure and set of components.There are two main types of intrusion detection systems [13].
Host intrusion detection system (HIDS) and network intrusion detection system (NIDS).The HIDS monitors the characteristics of a single host and the events occurring on that host to detect suspicious activity.NIDS monitors network traffic for specific network segments or devices and analyses the network.
Four main approaches are used for intrusion detection [13]: Signature-based IDS, anomaly-based IDS, hybrid-based IDS, and protocol-based IDS.Signature-based IDS detect hosts and malicious network activities based on known malicious patterns or sequences.Anomaly-based IDS show abnormal or anomalous system behaviour.It creates a profile of normal activity.If the normal activity exceeds the predefined threshold, it is considered an intrusion.Any deviation from the threshold is considered abnormal behaviour.Hybrid-based IDS combine the above two approaches (signature and anomaly-based IDS) to avoid the disadvantages and integrate the advantages.Protocol-based IDS monitor the protocols used by the system while performing an analysis of the state and dynamic behaviour and apply the legal use of the protocol.
Most of these types of IDS are still in traditional use and the cost of generating an appropriate signature for such an attack can be a considerable motivation for the use of learning-based approaches such as ML algorithms.These algorithms are progressing rapidly [14], [15], and are being followed with great interest in the field of cyber security [16].Machine learning (ML) techniques can automatically learn to make decisions based on existing data, which is a very valuable advantage for monitoring computing environments.
In the field of intrusion detection, two types of ML algorithms are generally used: Supervised classification for misuse detection and unsupervised outlier/novelty classification for anomaly detection [17].ML algorithms are increasingly being used to improve IDS and make it more efficient.A large number of research papers on intrusion detection fields using ML techniques have been published in the literature.For example, the authors of [18], [19] used support vector machine (SVM) to find anomalies in the knowledge discovery in database (KDD) dataset.Stein et al. [20] constructed an IDS model with artificial neural networks (ANN) based on the same dataset.Authors in [21], [22] proposed the use of decision trees and random forest.In addition, a hybrid approach combining two or more ML algorithms was presented in [23].For more information on intrusion detection systems using ML methods, the reader is strongly advised to read the reviews provided in [24]- [26].
However, some recent papers find that conventional ML algorithms still perform poorly compared to other alternatives.The reason may be simply because the model parameters are not set appropriately or not set at all.Most algorithms provide many parameter values that can be used to improve model performance.Therefore, these parameter values can be adjusted to select the most optimal model.This article will review the optimization of conventional ML algorithms using hyper parameter tuning to achieve good results based on the values available in each algorithm.
This section analyses various research papers that use ML algorithms and exploit their performance for intrusion detection.The focus here is on studies that use the Canadian Institute for Cybersecurity -intrusion detection systems dataset released in 2017 (CIC-IDS-2017) dataset.This dataset was proposed by Sharafaldin et al. [27] in 2017.They tested it with seven algorithms, namely random forest (RF), ID3, k-nearest neighbours (KNN), multilayer perceptron (MLP), Adaboost, Naive Bayes (NB), quadratic discriminant analysis (QDA).They prove that the KNN, RF and ID3 algorithms give the best results with 98% accuracy compared to the other algorithms.
Vijayanand et al. [28] proposed an IDS that uses a genetic algorithm (GA) for variable selection and a support vector machine (SVM) for classification.In this paper, classification is based on a combination of several SVMs, each designed to detect a specific type of attack.The same algorithm is used by Aksu et al. [29] with other algorithms such as k-nearest neighbor (KNN) and decision tree (DT).The authors of this paper use the fisher score method for variable selection and achieved detection rates of 99.70%, 57.76%, and 99.00% for SVM, KNN, and DT classifications, respectively, using the denial-of-service (DoS) and distributed denial-ofservice (DDoS) attacks of CIC-IDS-2017 for testing.
Bansal [30] describes a data dimensionality reduction method called data dimensionality reduction (DDR) and uses conditional tree (CTree) [8], extreme gradient boosting (XGBoost) [31], SVM, and artificial neural network algorithms (ANN).Among these estimators, XGBoost was the most efficient with an accuracy of 98.93%.The authors of this paper used the entire dataset except for the normal traffic provided on Monday.
Boukhamla and Gaviro [32] uses principal component analysis (PCA) to reduce the overall size of the CIC-IDS-2017 dataset.This work was used for Thursday and Friday data targeting various attacks such as DDoS, web attacks, port scans, infiltration, and botnets.KNN, Naive Bayes (NB), and C4.5 are the algorithms used for classification.DDoS attacks are perfectly detected by Naive Bayes and KNN with a very high ISSN: 2252-8938  Machine learning-based intrusion detection system for detecting web attacks (Fatimetou Abdou Vadhil) 713 detection rate.However, Naive Bayes has a high false alarm rate, whereas KNN does not.In particular, the number of variables was reduced by about 75% of the total number of variables.Ustebay et al. [33] proposes a hybrid IDS that combines three algorithms, namely reduced error pruning (REP) tree and random forest.They claim that the experimental results of this IDS demonstrate good performance in terms of detection rate, false alarm rate, accuracy, and duration compared to existing systems.The accuracy achieved in this proposal is 96.66%.
Hou et al. [34] presented a ML approach based on DDoS attack detection using netflow traffic analyzer (NTA).It mainly used four algorithms, namely AdaBoost, C4.5, support vector machine and random forest, against the data collected by netflow.This approach is then evaluated against the object dataset.Based on the results of the experiment, 97.4% accuracy was achieved using this approach.
Bansal and Kaur [35] proposed an intrusion detection approach using the XGBoost algorithm.This approach uses DoS/DDoS data from CIC-IDS-2017.The work is completed with 99.54% accuracy.
Aksu et al. [29] proposed an IDS using fisher score algorithm for selecting variables for normal and DDoS traffic.The algorithms used for attack classification are: SVM, KNN, and DT.In this work, the evaluation showed that KNN performed best with 30 variables selected, while SVM failed with 80 and 30 variables.After applying the fisher score, the amount of data was reduced by 60%.As a result of this work, KNN, and DT models scored 99% and 99%, and SVM scored 57% with 30 variables selected.
Alrowaily et al. [36] investigated the efficiency of seven ML algorithms, including RF, NB, DT, AdaBoost, MLP, quadratic discriminant analysis (QDA), and k-nearest neighbors (KNN).The result confirms the superiority of KNN on various performance evaluation metrics with 99% accuracy among the other selected algorithms.However, all the selected algorithms trained within acceptable time frame except this algorithm.
Thapa et al. [37] propose an ensemble model that combines ML and deep learning (DL) algorithms to achieve high performance metrics.They compared their models using the CIC-IDS-2017 dataset.In this article, the significance of variables was studied using classification and regression tree (CART) and classification using convolutional neural network (CNN).The accuracy obtained in this article is 99%.
Maseer et al. [38] review previous studies on ML and DL based IDS using a set of criteria with different datasets.In this article, 10 common supervised and unsupervised ML algorithms are evaluated.The supervised algorithms used are ANN, DT, KNN, NB, RF, SVM, and CNN, while the unsupervised algorithms include k-means, expectation-maximisation (EM), and self-organising maps (SOM).The best results were obtained with KNN (accuracy = 99.52%),DT (accuracy = 99.49%), and CNN (accuracy = 99.47%) as these models have higher recognition performance compared to other models.
The variables that can be used for designing and implementing an effective ML based IDS are analysed [39].The selected variables are applied to different ML methods to test the efficiency.This research is conducted on the CIC-IDS-2017 dataset using 30% of the data and 100% of the data from Wednesday.The best result is obtained with the random forest, which achieves an accuracy of 99.9% and a false positive rate (FPR) rate of 0.02%.
A systematic approach to decision support in the selection of algorithms for the design of an IDS is presented [40].The authors of this paper used the CIC-IDS-2017 dataset and selected 51 variables using the mean decrease in impurity (MDI) technique.They then evaluated the recognition performance of eight algorithms.The decision tree, random forest and multi-layer perceptron algorithms achieved 99% accuracy.
Elmrabit et al. [41] evaluated twelve ML algorithms in terms of their ability to detect abnormal behaviour in network practice.The evaluation is performed on the CIC-IDS-2017, UNSW-NB15, and industrial control system (ICS) cyberattacks datasets.The results of the evaluation show that the random forest algorithm performs better in terms of accuracy, precision, recall, F1 score, and receiver operating characteristic (ROC) curve for the three datasets used.
In terms of web attacks, Goryunov et al. [42] presented a study that includes the analysis and evaluation of different classifiers such as decision tree, random forest, AdaBoost, and logistic regression.The evaluation was done using the CIC-IDS-2017, more specifically web attacks (brute force, cross site scripting (XSS), and structured query language injection (SQLi)).Moreover, the top 10 features were selected, which is why the training time was very low in this study.

DATASET
The type of data is crucial in terms of quality and quantity for any ML problem, as a lot of important data is required for good training of ML algorithms.The outcome of any classifier depends on the trained data, so the quality of the data helps to achieve good results.Unfortunately, such quality datasets are also expensive and difficult to produce.In the field of intrusion detection, two free datasets are particularly popular despite some shortcomings: KDD Cup 99 and NSL-KDD.Three other recent datasets that address some of the shortcomings of earlier datasets are CIC-IDS-2017, CIC-IDS2018, and LITNET-2020.Based on some reviews and studies [27], [43]- [45], we decided to use CIC-IDS-2017.This dataset was created by the Canadian Institute of Cybersecurity -University of New-Brunswick and generated by Sharafaldin et al. [27] in 2017.It is fully labelled and contains 78 features and seven main attack classes such as: web attacks, portscan, botnet, heartbleed, DoS/DDoS, and infiltration.In this paper, we use the data collected under Web Attacks, which consists of SQL Injection, XSS, and Brute Force.
-SQL injection: This is a vulnerability in an application where the attacker interferes with an application's queries to the database to allow unauthorised users to access the data.-XSS: This attack occurs when the attacker injects malicious code into the victim's web application.
-Brute force: This attack tries a number of possible passwords to crack the administrator's password.
We focus on web attacks because, despite their importance, they are rarely the subject of research compared to other types of attacks.For example, many researchers focus on DoS/DDoS attacks.This could also be the reason why most datasets do not include this type of attack.Therefore, it is a matter of popularity of certain attacks over others.

METHODOLOGY
This section discusses the methodology used in this paper.After selecting the dataset, the next and most important step is to prepare it.This step is called pre-processing and includes analysis, processing, coding, and normalization.The dataset CIC-IDS-2017 is analysed and trained with five supervised ML algorithms: decision tree (DT) [46], random forest (RF) [47], logistic regression (LR) [48], Gaussian Naive Bayes (GNB) [49], and AdaBoost or adaptive boosting (AB) [50].The models implemented with these algorithms are optimised by tuning their hyperparameters.This optimisation is performed despite the GridSearchCrossValidation technique.

Data acquisition
The first step of the proposed methodology is to import dataset.The data used contains 170,231 records.Table 1 shows the number of records by attack and the distribution between training and test data.

Data cleaning and analysis
The selected data (the "web attacks" file recorded on Thursday) contains 170,231 samples.In this section, we analyse this data to identify potential problems and possible errors such as missing values and duplicate columns.While searching for these values, we found that this dataset contains 270 missing values, ranging from not a number (NaN) to infinity.On the other hand, the Fwd header length column is found twice.All values that are NaN or infinity are removed, and then the duplicate column is also removed.It is also important to check the class equilibrium in the response features (or target) as this is very important for most ML algorithms.We have therefore analysed the feature in question and the classes are really unbalanced.
The majority class is the normal traffic class, which takes 168051 of the data.The rest is distributed among other classes, 1,507 for brute force, 652 for XSS, and 21 for SQL injection.This problem of imbalance can be solved by oversampling, followed by pre-emptive cleaning of possible overlap points.

Encoding
To train a ML model, the raw data must be prepared in a form that can be understood by the model.For this reason, digitisation of the data is an essential phase.In the proposed method, the one-hot encoder scheme is used to convert non-numeric variables into vectors.One-hot encoding is the most widely used encoding method [51].It converts categorical values into vectors with minimal processing.It is a defensive feature of certain techniques.This is the most common method for handling such multiclass classification tasks.As shown in Table 2 the categorical classes are converted to vectors.

Features scaling
Feature scaling is a technique often used in data preparation to facilitate its use by ML algorithms.The purpose of this scaling is to bring the values of the numerical features in the data set to a common scale, while preserving the differences in the value ranges of the individual features.Thus, by scaling the features, the range of data-independent features is normalised.In data preprocessing, scaling is often done using one of two methods: normalisation or standardisation.Feature scaling should be performed during data preprocessing.We use a normalisation method known as min-max.It is the simplest method and involves scaling the range of features to scale the range in [0, 1].The general normalisation is as (1). (1)

Classification and hyper-parameters tuning
In this phase, the data is submitted to the decision tree (DT), random forest (RF), logistic regression (LR), Gaussian Naïve Bayes (GNB), and adaptive boosting (AB) algorithms for training.To obtain the best model parameters, we used the GridSearchCV technique.This is a method for selecting the best model from a family of models parameterised by a grid of parameters, i.e. all the possibilities of the parameters are traversed in a grid of parameters.In addition, GridSearch performs cross-validation.Table 3 shows the details of the values used for each model.For each algorithm, the best values selected by GridSearchCV are fitted to build the model.Algorithm 1: Hyper parameter tuning algorithm used in this research for decision tree, random forest, and AdaBoost.
a. x ← 0 b.hyper parameter 1 c. while x ≤ 314 do d.Perform a 5-fold cross-validation on CIC-IDS-2017 train set e. Record the average 5-fold cross-validation accuracy f.Increment hyper parameter by 1 g.Increment x by 1 h.Choose the best hyper parameter which gives the highest average GridSearchCV accuracy (in step.e) The Algorithm 1 shows the process of hyper parameters tuning for the models.This algorithm is used with GridSearchCV to select the good model.An overview of the proposed methodology is presented in Figure 1.

RESULTS AND DISCUSSION
The effectiveness of ML techniques has led researchers to use them in topical areas such as IoT [52], [53], online social network [54], DoS detection, DDoS [55]- [60], and distributed reflexion denial of service (DRDoS) [61].However, obtaining good results depends on the techniques used, and the way in which we create our models.Generally speaking, improving the results of ML models can be done in several ways, for example by adjusting the model's hyperparameters as proposed in this article or by selecting the most relevant features [62]- [68].On the other hand, this selection is in the framework of traditional models [69]- [74], this concept is still not valid, when it comes to DL models [75]- [80], if the model needs optimisation, in this case other methods must be sought, as the selection of the most relevant features is done automatically in most cases [81].In this section, we present our experimental results using different evaluation metrics for the various ML techniques based on the proposed methodology.Afterward, we provide a comparison with some related works and discuss the performance of the proposed model.The models are implemented with a portable PC with the following characteristics: i7-9750H CPU, processor, 8 GB memory, hard disk 1 TB, and 64-bit Windows 10.

Evaluation metrics
After setting up the models, it is time to measure the performance by going through an evaluation stage.Each model needs to be tested to confirm its reliability based on four possible outputs, true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).The models used in this work were evaluated based on accuracy, precision, recall, F1 score, FPR, and training and prediction time.Where: -TP: True positives are events that are correctly identified as abnormal -FP: False positives are legal events that are incorrectly identified as abnormal -TN: True negatives are incidents that are correctly identified as legal activities -FN: False negatives can be defined as possible intrusive activity that the IDS passes through as normal activity.

Experimental results
The best results are shared by the ensemble method (accuracy and precision), AdaBoost (recall and FPR), and DT (F1_score).On the other hand, GNB was trained in a very reasonable time; the same is true for the prediction time of LR.The computation time depends on the number of hyperparameters and the set of values defined in each of the hyperparameters in grid search.The Table 4 presents the evaluation results to show how the models performed.Table 4. Performance evaluation of models on CIC-IDS-2017 (web attacks) In order not to be limited to the previously used evaluation metrics, we evaluated the models with other metrics such as ROC and calibration curves as shown in Figure 2. Other algorithms like gradient boosting, support vector machine, k-nearest neighbors were tested, but since we use the GridSearchCV technique, which

Comparison with similar researches
The models implemented in this work were compared with some recent works.The selection of works is based on the CIC-IDS-2017 dataset and the five algorithms DT, RF, LR, GNB, and AB.Table 5 shows our results compared to those of other works.The results seem to be in the middle range of previous research.At this stage, there are some important factors to consider: (1) Some of this research uses feature selection, which always improves the results; and (2) Another factor is that some of them use the whole dataset, which also helps to improve the model, especially if it is a binary classification.

Discussion
The results presented in this paper were obtained without a selection of variables.Thus, it may seem that our models can provide much better results by selecting the most relevant variables.On the other hand, we have to mention the imbalance of the CIC-IDS-2017 dataset.This phenomenon sometimes affects the results of the models in such a way that most of the learning is done on the majority of the dataset, which means that the detection of minority samples is weak.Some researchers believe that binary classification can partially solve the imbalance problem.However, in this case, it is no longer possible to know in detail the positive detection for each attack.Finally, data balancing can solve this problem.There are a number of methods such as random oversampling, synthetic minority over-sampling technique (SMOTE), SMOTE-EEN, and SMOTE-Tomek.

CONCLUSION
An efficient intrusion detection system must be able to detect any kind of abnormal behaviour with high accuracy and low false alarm rate.In this work, we have used five ML algorithms and their ensemble approach to detect web attacks (such as SQL injection, brute force, and XSS) in the CIC-IDS-2017 dataset by applying k-fold cross validation with K=5.We used GridSearchCV for hyperparameter tuning to achieve better performance of the models.It provides high accuracy, precision, recall, and F1 score while maintaining low FPR.According to the results obtained, conventional classifiers can give competitive performance without any additions or adjustments.Thus, the research questions stated in section 2 can be answered.However, the results could be even better if the most important variables in the dataset had been selected.In the future, we will focus on the models that are optimised using feature selection.We will also go through other optimisation methods.

Table 1 .
Data distribution

Table 2 .
One hot encoding

Table 5 .
Comparison of the results obtained with similar research result