Hyperparameters optimization XGBoost for network intrusion detection using CSE-CIC-IDS 2018 dataset

ABSTRACT


INTRODUCTION
As the internet becomes more widely used, an increasing number of computers are being networked.However, with the rapid advancement of digital technology, network data traffic has become vulnerable to numerous security risks and potential breaches.A vast amount of information has been transmitted across intricate network connections worldwide.Consequently, the establishment of a secure information system has garnered significant attention from both private and governmental institutions, aiming to thwart potential attackers [1].Network attacks have emerged as one of contemporary society [2].Intrusions refer to actions aimed at bypassing the security measures of computer systems [3].The complexity and challenge of intrusion detection systems in heterogeneous networks are heavily influenced by the variety of devices, protocols, and services employed.As a result, the network's intricacy increases, making it arduous to identify and detect intrusions [4].To address this pressing need for stronger protection, intrusion detection has become crucial in monitoring computer systems and networks, and analyzing events for signs of potential intrusions [5].The intrusion detection system (IDS) is a dynamic monitoring system that assesses system activity in a given environment to determine whether it constitutes an attack [6], [7].IDS systems can be categorized into two types based on their detection methods: signature-based and anomaly-based.Signature-based IDS functions similarly to antivirus software by identifying known attack patterns or signatures.Although it has high accuracy and a low rate of false positives for known attacks, it lacks the ability to detect new attacks.This type of IDS generally has a higher false-positive rate, and building a model requires a reliable training dataset.
Anomaly-based IDS is cybersecurity solutions that monitor network traffic, system activity, or user behavior to detect deviations from set baselines.They detect odd patterns or suspicious actions by utilizing machine learning and statistical methodologies, potentially identifying unforeseen risks.However, they may generate false positives and require ongoing calibration to reduce such warnings [8]- [12].A network intrusion detection system (NIDS) is a critical component in identifying ongoing attacks by differentiating normal network traffic from malicious traffic [13].NIDS is essential in resolving security issues by monitoring network traffic for potential signs of suspicious activity and detecting any security vulnerabilities, such as infiltration, abuse, and anomalies, in the data extracted from network traffic [14].
Traditional intrusion detection systems often use machine learning-based methods, and many different machine learning algorithms have been developed and are available for use [15]- [19].Machine learning algorithms that are widely accepted and being used include: 1) logistic regression (LR), 2) decision tree, 3) random forest, 4) extreme gradient boosting (XGBoost), and 5) k-nearest neighbor.Thease algorithms are a classification models that predicts the target class of data samples in classification problems.The main problem is the neccessity to identify fresh, complex attack patterns that traditional intrusion detection systems frequently overlook and which pose a serious risk to network security.In order to prevent overtaxing security staff and ineffective incident response, it is critical to address the issue of a high falsepositive rate concurrently.Therefore, the main research issue is to develop an intrusion detection system that can accurately identify new attack patterns while also lowering false alarms.To achieve this goal, the IDS's accuracy and responsiveness in fending off developing cyber threats must be improved.Todo this, enhanced anomaly detection techniques, machine learning algorithms, and the integration of realtime threat intelligence must be investigated.
The study's goal is to create an effective intrusion detection system capable of distinguishing between normal network traffic and malicious activity and quickly detecting potential security weaknesses such as infiltration, misuse, and anomalies.Hyperparameter optimization is a process that involves finetuning the input parameters, or hyperparameters, that affect a machine learning algorithm's training phase and model.Because of the scale of the challenge, hyperparameter adjustment is frequently required in machine learning jobs [16].As a result, tuning hyperparameters will be critical for achieving peak performance in machine learning tasks.This research focuses on intrusion detection, which is critical in monitoring computer systems and networks for signals of potential security breaches.We are specifically concerned with the issues posed by intrusion detection systems in heterogeneous networks, where the broad assortment of devices, protocols, and services makes reliably identifying and detecting intrusions extremely challenging.

BACKGROUNDS AND RELATED WORKS
Because of the complexity and diversity of today's networks, intrusion detection is a difficult task.Traditional intrusion detection systems frequently employ machine learning-based methods, but the number of available algorithms, as well as the requirement for effective hyperparameter tweaking, pose considerable obstacles.The key research question addressed in this paper is: How can we create an effective IDS that can consistently discriminate between normal network traffic and hostile activities while rapidly detecting potential security flaws?Our primary goal is to develop and deploy an IDS that uses machine learning techniques to accurately classify network traffic.In addition, we intend to undertake a thorough performance evaluation of five different machine learning algorithms (LR, decision tree, random forest, XGBoost, and kneighbor) in the context of intrusion detection.Each machine learning algorithm has its own set of benefits and drawbacks, which are discussed more below.LR is a simple and uncomplicated technique that works effectively when the connection between features and the target variable is linear.It delivers probabilities for binary classification tasks, has a low computing overhead, and can quickly train large datasets.LR, on the other hand, has difficulties when it comes to capturing complex correlations between features and the target variable, and it tends to underperform when the data is not linearly separable.It is also susceptible to outliers and multicollinearity.LR must be modified utilizing one-vs-all or one-vs-one techniques to be relevant in circumstances involving several classes [20].
The simplicity of the decision tree, which provides unambiguous decision rules that are easy to comprehend and follow, is one of its many advantages.It is capable of handling both numerical and ISSN: 2252-8938  Hyperparameters optimization XGBoost for network intrusion detection using … (Witcha Chimphlee) 819 categorical data without the need for considerable data preprocessing.It is resilient to outliers and does not presume a certain data distribution because it is non-parametric.Furthermore, decision trees can recognize non-linear correlations in data.However, there are some disadvantages to consider.Decision trees have a tendency to overfit the training data, especially as the tree grows in depth.They are extremely sensitive to slight changes in the data, resulting in unstable trees.Biased trees may be formed if some classes dominate the data, and they may be limited in their capacity to generalize well to unknown data, particularly in complicated datasets [21].
Random forest has various advantages over individual decision trees, including enhanced accuracy due to less overfitting using ensemble approaches.Because it aggregates several trees, it is resistant to outliers and noisy data, and it can handle high-dimensional data with numerous characteristics.Random forest also allows for the rating of feature relevance, which benefits in feature selection.There are, however, certain disadvantages to consider.When compared to individual decision trees, it has a higher level of complexity and processing expense.Random forest is opaque, making it difficult to understand the basis behind forecasts.Furthermore, its prediction time is longer than that of single decision trees, and characteristics such as the number of trees must be tuned for improved results [20], [22].
K-nearest neighbor (KNN) has a number of advantages, including its simplicity and intuitive nature, which makes it simple to apply.It is well-suited for capturing non-linear relationships in data because it is a non-parametric technique.KNN does not need a training phase because it memorizes the data points during training.It is effective with small to medium-sized datasets.However, there are some disadvantages to consider.Because distances to all data points must be determined, the testing process might be computationally expensive.The distance metric chosen can have a considerable impact on its performance.KNN is extremely sensitive to the existence of irrelevant features, which could result in bad outcomes.To avoid biased results, proper data standardization is essential [23].
Extreme gradient boosting, or XGBoost, is a supervised learning technique that belongs to the family of gradient-boosted decision trees (GBDT) machine learning algorithms.It was created by Chen and Guestrin [24] for classification and regression problems.XGBoost is trained by iteratively adding based learners in the form of decision trees to an ensemble while minimizing a regularized objective function.In XGBoost, the objective is to minimize the difference between the predicted values (pi) (t-1) and the actual values (yi) using a loss function.To do this, the algorithm adds decision trees (fc) to the ensemble iteratively, with each tree decreasing the difference in predictions from the preceding iteration.To prevent overfitting, the regularization term penalizes the complexity of the extra trees.XGBoost is a powerful algorithm that is gaining increasing attention due to its speed and accuracy [25].It is excellent at handling missing data, provides perceptions into the significance of features, and integrates regularization to avoid overfitting.It also scales effectively through parallel processing, making it appropriate for big datasets.XGBoost has a number of advantages, including superior prediction performance due to its boosting method, which concentrates on misclassified data points.It uses regularization techniques to avoid overfitting and is capable of processing a wide range of data types, including numerical and categorical data.Because of its parallelizable and efficient implementation, XGBoost is appropriate for huge datasets.There are, however, certain disadvantages to consider.If the hyperparameters are not appropriately set, it may be prone to overfitting.Furthermore, XGBoost might be difficult to read, especially when dealing with a large number of trees.When compared to other algorithms, training time can be longer, and attaining optimal results requires careful parameter optimization [2], [22].
Hyperparameters are pre-learning parameters that are not part of the model.Proper tuning of hyperparameters is critical for enhancing model performance and minimizing loss.The values of hyperparameters dictate the model parameters, and the purpose of hyperparameter tuning is to discover the best values that lead to optimal model performance and superior outputs.The regularization and construction of XGBoost are highly influenced by hyperparameters such as learning rate, ensemble size, and maximum depth of base learners.The goal of hyperparameter optimization in HO-XGB is to minimize the objective function in (1).Optimization of hyperparameters is an important task in automated machine learning since it improves model performance.But careful tuning of its many hyperparameters is required, which calls for a thorough knowledge of the method.It can be computationally and memory-intensive, especially when working with large datasets or deep trees.It differs from simpler models in interpretability as a black-box model that is a little less easy to understand.Managing unbalanced datasets effectively may also call for the use of extra methods or specifications.
Several hyperparameters are adjusted in this study, including learning_rate, subsample, max_leaves, max_depth, gamma, colsample_bytree, min_child_weight, n_estimators, max_depth, and reg_alpha.These variables have a significant impact on the model architecture utilized by XGBoost [24], [26].Although XGBoost has many hyperparameters in Table 1, this study only focuses on those that have been shown to significantly impact model performance in previous studies.The hyperparameters "subsample," "learning_rate," "max_leaves," "gamma," "max_depth," "colsample_bytree," and "min_child_weight" are used in this study, while the remaining hyperparameters are set to their default values in Python [27].The ratio of training instances that are randomly selected for fitting each individual tree.max_leaves The maximum number of nodes that can be added to a tree.max_depth The maximum depth allowed for a tree.gamma The minimum amount of loss reduction that is required to partition further.colsample_bytree Colsample_bytree: The ratio of features/columns that are randomly selected for fitting each individual tree.min_child_weight The minimum weight required for instances to be included in a leaf.reg_alpha L1 Regularization term on weights The gamma parameter is preferred over the min_child_weight parameter because it regulates the complexity resulting from the loss, rather than the loss derivative from the hessian weight.The objective is to fit the parameters to the data without overfitting, which means tuning the algorithm to the extent that it identifies too many characteristics that are only relevant to the present data.Each parameter has its own potential for causing problems.Prior to running XGBoost, three types of parameters must be established: general parameters, booster parameters, and task parameters.The parameters used in XGBoost can be classified into three types: general parameters, booster parameters, and learning task parameters.General parameters are used to specify the type of booster being used for boosting, such as a tree or linear model.Booster parameters, on the other hand, are specific to the chosen booster and determine its behavior during training.Finally, learning task parameters, such as regression or ranking tasks, are used to determine the type of learning scenario, and many sets of parameters may be necessary.We will rigorously fine-tune the hyperparameters of these machine learning algorithms to obtain optimal intrusion detection capabilities to further improve the IDS's effectiveness.Finally, our research aims to contribute significantly to improving the overall security of computer systems and networks by effectively identifying and mitigating potential security threats such as infiltration, misuse, and anomalies.

METHODS
The University of New Brunswicks Canadian Institute, for Cybersecurity (CIC) has developed a dataset called CSE-CIC-IDS-2018 [28], [29].This dataset includes 9 million records of network traffic data encompassing both normal and malicious activities.It covers a range of attack types such as denial of service (DoS), distributed denial of service (DDoS), reconnaissance, penetration, and botnet activities.The dataset is publicly available in formats like CSV files.Serves as a valuable resource for researchers and practitioners to create and test intrusion detection algorithms in a controlled lab setting.The dataset contained 16,233,002 records from a significant network comprising attack and victim workstations [28], [29].DoS assaults, Brute force attacks, Botnet attacks, DDoS attacks, Web attacks, and infiltrations are among the 14 different types of attacks and six different infiltration scenarios included in the dataset.Brute force-web, botnet, Secure Shell (SSH) brute force, DDoS -High Orbit Ion Cannon (HOIC), DDoS -Low Orbit Ion Cannon (LOIC), user datagram protocol (UDP), and HTTP attacks, structured query language (SQL) injections, Brute force-crosssite scripting (XSS), DoS GoldenEye, DoS Hulk, DoS slow HTTP test, infiltration, and DoS Slowloris are all examples of DDoS attacks.The dataset has been widely used to develop IDS using machine learning techniques and is now the standard for anomaly-based IDS implementations.The study's focus is on the analysis and pre-processing of the CSE-CIC-IDS-2018 dataset, which consisted of ten raw data files with 16 million unique network flows representing various types of attacks.These files were combined to form a single dataset during the integration stage.
Following an examination of the abstract and an initial literature review, a problem was detected, leading to the discovery of a defect.The design of the model represented in Figure 1 was inspired by this shortcoming.It illustrates the HO-HGB algorithm implemented, starting with data preprocessing, exploratory data analysis, and data preparation in step 1.For detailed information, please refer to sections experiment design (A) and (B).After constructing the dataset, machine learning techniques, including XGBoost and traditional methods, are applied following a train-test split procedure to assess their predictive performance.The XGBoost model is trained using the input data, and hyperparameters are fine-tuned to optimize ISSN: 2252-8938  Hyperparameters optimization XGBoost for network intrusion detection using … (Witcha Chimphlee) 821 algorithm configuration for the dataset.Finally, a performance evaluation is conducted to assess the effectiveness of the model.The proposed methodology in the development phase is validated, and system functionality is evaluated by comparing the classification results with those of the same dataset.In this paper, the labeled network flows will be divided into two categories for analysis: attacks and benign.The benign category will include all traffic classified as normal, while the attacks category will include all traffic classified as anomalous.The dataset is split into 70.1% benign traffic and 29.9% anomaly traffic.

Figure 1. The flow chart of the proposed model 4. EXPERIMENTAL SETUP
The system utilized in this study was a 64-bit macOS Ventura with the following specifications: an eight-core Intel Core Xeon W processor running at 3.2 GHz, 32 GB of 2666 MHz DDR4 memory.Python version 3.11 environment was used, and the implementation and evaluation of the recommended model were carried out using NumPy [30], pandas [19], and sklearn [31] packages for data processing.Data handling, preprocessing, and analysis were performed using Pandas and NumPy libraries, while Scikit Learn was utilized for model training, evaluation, and evaluation metrics.Data visualization was carried out using Seaborn library and Matplotlib.The following subsections provide more detailed information.

Data pre-processing
"Data pre-processing" in machine learning refers to the process of preparing the original data for use with machine learning (ML) algorithms.This involves tasks such as data cleaning, feature scaling (to standardize the range of data, particularly when there is a large variation between values among different features to avoid bias from outliers), and feature engineering.Categorical variables were encoded using onehot encoding to convert them into binary representations.
After removing variables with missing values from the original dataset, we replaced the remaining variables' missing values.We describe the experiment settings, including how to split the dataset, address class imbalance issues, and implement seven machine learning classifiers.All dataset processing steps are fully documented here.
To make the training less sensitive to feature scaling and avoid similar sounds when applying the model to the test dataset, the sklearn preprocessing package was used to replace the majority of the data with its standard deviation and scale it to a range of 0 to 1 using the MinMaxScaler.This results in each numerical feature being set to the range of 0.0 to 1.0.-Features with "NaN" values like "Bwd PSH Flags", "Bwd URG Flags", "Fwd URG Flags", "Fwd Byts/b Avg", "CWE Flag Count", "Fwd Pkts/b Avg", "Fwd Pkts/b Avg", and "Bwd byts/b Avg" were removed.The 'Timestamp' column was removed to prevent the learners from differentiating between attack predictions based on time, particularly when dealing with covert attacks.-Attacks were assigned numerical values and are now represented in the 'Label' column.The dataset's feature count drops from 80 to 69 after preprocessing.Then, both training and testing subsets are built using this enhanced data.With fewer features, the dataset is more streamlined, with unnecessary data being removed and efficiency being improved.Strong model training and evaluation are made possible by the development of separate training and testing subsets, guaranteeing that the model's prediction skills are properly refined.The accuracy and effectiveness of the model are optimized through this procedure, making it suitable for practical implementations in real-world circumstances.This involves concentrating on the most important qualities and eliminating the unnecessary ones.

Performance evaluation criteria
In this section, the evaluation criteria for assessing the performance of the proposed IDS are outlined.Various metrics such as accuracy (ACC), false alarm rate (FAR), and detection rate (DR) [7] were used to evaluate the IDS.The confusion matrix (CM) was used to determine the number of links correctly identified by the classifier as anomalies true positives (TP) and true negatives (TN).TN represents the number of normal connections that the classifier correctly classified as normal.In addition to TP and TN, the CM also includes false negatives (FN) and false positives (FP) to indicate the classification result.FN is the number of anomaly connections that the classifier improperly labeled as normal, whereas FP represents the number of normal connections that the classifier incorrectly tagged as anomalies.The true positive rate (TPR), also known as recall or sensitivity, is calculated as (2).
The sensitivity metric may be biased if FP and TN are excluded from the calculation, especially when dealing with unbalanced class distributions.Additionally, the following formula is used to calculate the true negative rate (TNR), also known as specificity, When dealing with unbalanced data classes, specificity may offer false insights by ignoring FN and TP.The F1-score, which takes into account both recall and precision, shows to be a better evaluation tool for a more thorough analysis.It accounts for both false positives and negatives, providing a comprehensive evaluation of a model's performance that is less prone to distortion in situations when class distribution is asymmetric.
Various performance indicators, including accuracy, precision, recall, and F1-score, are used to assess the success of machine learning classification models.The F1 score is a model score that combines recall and precision.This statistic, like Accuracy, takes accuracy and recall into account when evaluating the model's performance.For machine learning models, the F-score is a valuable performance measure.

RESULTS AND DISCUSSIONS
The pearson correlation coefficient was used to find and exclude strongly linked variables.Our study showed 13 features with a correlation over 0.7, including 'Dst Port', 'Flow Byts/s', 'Fwd URG Flags', 'FIN Flag Cnt', 'PSH Flag Cnt', 'ACK Flag Cnt', 'URG Flag Cnt', 'CWE Flag Count', 'Init Fwd Win Byts', 'Init Bwd Win Byts', 'Fwd Seg Size Min', 'Active Std', and 'Idle Std'.To compare our model with other successful machine learning methods previously trained on this dataset, we employed logistic regression, decision tree classifier, random forest classifier, k-nearest neighbors' classifier, and XGBoost classifier.We also used hyperparameter optimization XGBoost to outperform traditional machine learning techniques.We evaluated the algorithms' performance using accuracy, precision, recall, and F1-score and generated performance In this experiment, we compared the performance of XGBClassifier with traditional machine learning techniques on the dataset.The results are presented in the form of the receiver operating characteristic (ROC) curve, with the performance curve on classification depicted in Figure 2. XGBoost identified as the most effective approach based on the results.Following experiments involved fine-tuning the parameters to obtain peak performance.It demonstrates that the proposed XGBClassifier outperformed traditional methods, achieving a ROC score of approximately 0.999926.The performance is considered optimal when the ROC curve approaches the upper left corner.
A set of hyperparameter setups for a machine learning model, presumably connected to XGBoost are shown in Table 3.The hyperparameters it includes are learning_rate, n_estimators, max_depth, min_child_weight, gamma, subsample, colsample_bytree, and reg_alpha.Each row in the table represents a separate parameter, while the columns reflect multiple runs or configurations (HO-XGB1 through HO-XGB7).These hyperparameters regulate key aspects of the behavior of the model, such as the step size of the learning rate, the number of boosting rounds, the depth of the tree, instance weight thresholds, regularization, and the ratios of feature and sample subsampling.To optimize the model's performance and successfully adapt it to diverse machine learning tasks, hyperparameter tweaking is necessary.This is done by experimenting with various combinations of these variables.The significance of picking optimal combinations of hyperparameters rather than painstakingly assessing each parameter individually is highlighted in this research.Several hyperparameter tuning strategies were studied in order to determine the most successful ones.Random_state = 25, nthread = 4, scale_pos_weight = 1, seed = 27, and Objective = 'binary: logistic' were also used.The results showed that the xgboost complexity was extremely limited, making it difficult to create trees without pruning because the loss threshold was not met due to Gamma.The closeness of the train and test datasets contributed to this problem.Notably, changing the parameters, notably the max_depth and min_child_weight values, resulted in significant improvements in effectiveness.Based on these findings, the suggested model, HO-XGB1, performed admirably in solving the network intrusion detection problem, as proven by an amazing ROC score of 0.999991 and an F1 score of 1.0, as shown in Table 4.The experimental results show that HO-XGB1 outperforms different parameter settings, effectively optimizing XGBoost's hyperparameters for intrusion detection on the CSE-CIC-IDS-2018 dataset.Our ultimate goal is to create and deploy an intrusion detection system (IDS) that accurately classifies network traffic using machine learning methods.Furthermore, in the context of intrusion detection, we propose to undertake a full performance evaluation of five unique machine learning methods (logistic regression, decision tree, random forest, XGBoost, and K-neighbor).We will also concentrate on tuning the hyperparameters of these machine learning algorithms to achieve peak performance in the intrusion detection task.Finally, our research intends to greatly improve the overall security of computer systems and networks by identifying and mitigating potential security threats such as infiltration, misuse, and anomalies.

CONCLUSION
The goal of this research is to create a strong intrusion detection system that uses machine learning techniques and tailored hyperparameters to improve the security of our increasingly linked digital world.We intend to make a substantial contribution to the ongoing effort to secure sensitive data and networks from harmful attacks by fulfilling these goals.The proposed approach has been tested on real-world CSE-CIC-IDS2018 datasets, and the performance of XGBoost was compared to that of a traditional classification model using metrics such as accuracy, area under the ROC curve (AUC), recall, and F1 score obtained from a 10-fold cross-validation.According to the findings, XGBoost surpasses other detection algorithms.To fully exploit the benefits of XGBoost, we created the HO-XGB model, which entails fine-tuning multiple hyperparameters.We investigated the effect of learning_rate, subsample, max_leaves, max_depth, gamma, colsample_bytree, min_child_weight, n_estimators, max_depth, and reg_alpha on algorithm performance.The HO-XGB1's remarkable performance may be attributed to the careful selection and tuning of hyperparameters.The model successfully lowered complexity by improving max_depth and min_child_weight, producing excellent results with a ROC score of 0.999991 and an F1 score close to 1.0.This outperformance of various parameter settings on the CSE-CIC-IDS-2018 dataset demonstrates HO-XGB1 as the better solution for network intrusion detection.HO-XGB1 was able to successfully address the intrusion detection challenge because of the researchers' attention to hyperparameter tuning and comprehensive assessment of model complexities, making it a highly effective solution for real-world cybersecurity applications.Despite numerous machine learning techniques being proposed for intrusion Int J Artif Intell ISSN: 2252-8938  Hyperparameters optimization XGBoost for network intrusion detection using … (Witcha Chimphlee) 825 detection systems (IDS), the desired level of performance has not yet been achieved.This is because the types of network attacks have evolved, highlighting the need to update the datasets used for evaluating IDS.
Moving forward, we plan to explore various clustering strategies to enhance accuracy across a range of domains.

Table 1 .
XGBoost parameters tuning for the model Eight fields, including "Bwd PSH flags", "Bwd URG flags", "Fwd Avg Bytes Bulk", "Fwd Avg Packets Bulk", "Fwd Avg Bulk Rate", and "Bwd Avg Bytes Bulk", were eliminated from each instance.-Negativevalues such as "Init Win bytes forward" and "Init Win bytes backward" were disregarded.-TheProtocol field was eliminated as it is rather redundant, given that the Dst_Port (Destination Port) field primarily contains comparable protocol values for each destination port value.-Two columns ('Flow Bytes' and 'Flow Pkts') contained infinity values and were set to the maximum value in the column.-  ISSN: 2252-8938 Int J Artif Intell, Vol. 13, No. 1, March 2024: 817-826 822 -

Table 2 .
823indicators using the classification report function in the Scikit Learn module in Python.The use of different classifier techniques in intrusion detection systems is an emerging study in machine learning.Table2presents the performances obtained from traditional Machine Learning.The metrics for each classifier have been compiled, and according to the table, the XGBoost Classifier had a positive impact on the dataset, achieving 100% success.As seen in the table, the XGBoost Classifier achieved a 99.84% success rate for the CSE-CIC-IDS2018 dataset.Comparison XGBoost with traditional machine learning algorithms Hyperparameters optimization XGBoost for network intrusion detection using … (WitchaChimphlee)

Table 4 .
Comparison of the result by various hyperparameter tuning