IAES International Journal of Artificial Intelligence (IJ-AI)

Received Jun 22, 2022 Revised Oct 3, 2022 Accepted Nov 2, 2022 Precise traffic classification is essential to numerous network functionalities such as routing, network management, and resource allocation. Traditional classification techniques became insufficient due to the massive growth of network traffic that requires high computational costs. The arising model of software defined networking (SDN) has adjusted the network architecture to get a centralized controller that preserves a global view over the entire network. This paper proposes a model for SDN traffic classification based on machine learning (ML) using the Spark framework. The proposed model consists of two phases; learning and deployment. A ML pipeline is constructed in the learning phase, consisting of a set of stages combined as a single entity. Three ML models are built and evaluated; decision tree, random forest, and logistic regression, for classifying a well-known 75 applications, including Google and YouTube, accurately and in a short time scale. A dataset consisting of 3,577,296 flows with 87 features is used for training and testing the models. The decision tree model is elected for deployment according to the performance results, which indicate that it has the best accuracy with 0.98. The performance of the proposed model is compared with the state-of-the-art works, and better accuracy result is reported.


INTRODUCTION
Accurate traffic classification is of essential importance to different network activities like monitoring, resource utilization, and enabling quality of service features, e.g., traffic shaping and policing [1]. Traditional network traffic classification approaches, such as the port-based approach, identify an application by examining the packet header. However, this approach is unreliable since the current applications flow with unusual port numbers or select dynamic ports, leading to a rise in the false-negative rate of the classifier. In some situations, illegitimate applications hide using standard ports to avoid being filtered, increasing the false-positive results of classifiers due to undetectable applications. In addition, it is unfeasible to recognize the actual port numbers when handling encrypted data [2].
Deep packet inspection (DPI) is developed to overcome the insufficiency of the port-based approach. It is based on inspecting the contents of the packet rather than its header [3]. Although this approach is considered reliable, it has some weaknesses. First, it is computationally expensive since it needs several accesses to the packet content. Second, it is impossible to examine an encrypted packet using this method. Finally, privacy challenges are encountered when inspecting packet contents. Consequently, this method struggles when coping with an enormous number of flows and a rapid rate of network traffic. Researchers are still seeking an efficient way to classify applications with high performance and speed. The previous works present valuable insights into more intelligent networks. However, to the best of our knowledge, none of the current approaches uses the spark framework as a ML tool for SDN traffic classification. Spark is considered a competing big data framework that can build ML models and train them on massive data faster than other competing tools. Therefore, the main objective of this work is to design a spark-based traffic classification model as a step toward including intelligence in SDN.

THEORETICAL CONCEPTS
Before diving deeply into the proposed SDN traffic classification model, it is essential to describe the theoretical fundamentals related to the proposed work. This section briefly introduces the concepts and techniques applied in this work. It includes Apache Spark, its characteristics, spark application architecture, spark ML library (MLlib), ML pipeline, the dataset used, the feature groups of the selected dataset, and the variance thresholding technique used for dimensionality reduction.

Spark
Spark is a big data framework to manage and coordinate the execution of tasks on data across a cluster of computers [13]. It is 100 times faster than hadoop. Spark provides a combination of fault-tolerance, in-memory processing, scalability, and speed [14], [15]. The cluster is managed by a cluster manager like yet another resource negotiator (YARN), Mesos, or spark's standalone cluster manager. In this work, the cluster is managed by spark's standalone cluster manager.
As Figure 1, the Spark application consists of a driver process and a set of executor processes [16]. Part of the driver's work is to analyze, distribute and schedule work across the executors. At the same time, the executors are responsible for carrying out the work assigned to them by the driver [17]. Apache spark is primarily developed using scala [18]. To utilize spark with python, pyspark is released, which is considered an interface for spark in python. It combines python's simplicity and spark's power to handle big data projects efficiently. Figure 1 clarifies the relationship between spark session and spark's language application programming interface (API).

Machine learning with Spark
Recently, various types of structured and unstructured data are likely generated by humans and machines of huge sizes. As a result, solving ML problems using traditional techniques face a big challenge. Here comes the need for a distributed ML framework to handle these problems efficiently. Developed on top of spark, MLlib is a library that provides preprocessing, model training, and making predictions at scale on data [19]. Various ML tasks can be performed using MLlib like classification, regression, clustering, deep learning, and dimensionality reduction. MLlib integrates seamlessly with other spark components like spark streaming, spark SQL, and dataframes [20]. In spark, a dataframe is a collection of data arranged into named columns distributed across multiple nodes.

Machine learning pipeline
The concept of Pipelines is to ease the creation, tuning, and examination of ML workflows. It consists of stages chained together to automate a ML workflow [21]. Each stage is either an estimator or a transformer. An estimator is an abstraction of an algorithm fitted on a dataframe to create a transformer; e.g., a learning algorithm is an estimator which trains on a dataframe and develops a fitted model. A transformer is an algorithm that transforms one dataframe into another by deleting, adding, or updating existing features in the dataframe. For example, a ML model is a transformer that transforms a dataframe with features into a dataframe with predictions appended as columns. Pipeline stages are run consecutively, and the input dataframe is converted as it goes through each stage. The pipeline design is elaborated in section 4.

Dataset description
This paper uses the "internet protocol (IP) network traffic flows labelled with 75 Apps" dataset available in the Kaggle repository [22]. It is considered a suitable choice for this work since it is real-world and diverse. It was collected in the network section of the University of Cauca, Colombia. It consists of 3,577,296 records stored as a comma-separated values (CSV) file [23]. This dataset includes 87 features. Each record carries IP flow information such as source and destination IP addresses, port numbers, and interarrival time. Numeric features are the majority in this dataset. In addition, there are nominal features and a date type (timestamp) feature. Table 1 presents these features as categories. Nearly all network traffic classification datasets are built to recognize the class of application an IP flow carries world wide web (WWW), file transfer protocol (FTP), domain name system (DNS). This dataset goes even further by creating ML models to detect 75 applications such as Dropbox, YouTube and Google.

Feature selection
It is familiar to have hundreds or even thousands of features in Today's datasets. More features might give more information about each record. However, these additional features might introduce complexity without offering valuable information [24]. In ML, the biggest challenge is to build robust predictive models using a minimum number of features. The idea of feature selection is to eliminate the number of input features when building a predictive model to enhance the overall performance. It aims to mitigate problems such as the curse of dimensionality and computational cost.
Nevertheless, given the sizes of massive datasets, it isn't easy to figure out which feature is important and which isn't. This work uses the variance thresholding technique. It is a robust, fast, and lightweight technique to remove features with very low variance, i.e., features with unnecessary information. Variance presents the distribution spread and the average squared distance from the mean. Features with variance equal to zero add complexity to the model without any benefit to its predictive power. It is Architecting a machine learning pipeline for online traffic classification … (Sama Salam Samaan) 865 calculated according to the [25], in which σ 2 is the sample variance, xi is the feature value, is the feature mean, and n is the number of feature records. The application of this method is explained in section 4.

THE PROPOSED MODEL FOR SDN TRAFFIC CLASSIFICATION USING SPARK
This section presents the proposed model for SDN traffic classification using spark. The main contribution of this paper is located in the SDN application plane. In this model, Figure 2, two phases are introduced; learning and deployment. The phases are explained in subsections 4.1 and 4.2, respectively.

Learning phase
This phase (illustrated in Figure 3) highlights the essential contribution of this work. In this phase, a ML pipeline is designed as a powerful method to automate complicated ML workflows. Before dealing with the pipeline design, the nominal typed features and the timestamp feature are dropped from the dataset since the vector assembler, the second stage in the pipeline, accepts only numeric, Boolean, and vector types [25]. The dropped features are ProtocolName, FlowID, SourceIP, DestinationIP, label (string type features), and timestamp (date type feature). In addition, duplicated records are removed since they might be a reason for non-random sampling and could bias the fitted model [7]. The number of removed records equals 10,949. As a result, the number of records becomes 3,566,347.
Afterwards, the resulting dataframe is split into 70% for training and 30% for testing. Two subsets are constructed. The first subset (training dataframe) is used to train the model. The second subset (testing dataframe) is used for model evaluation to realize how the model performs on unseen data. The training dataframe consists of 2,493,090 records, and the testing dataframe consists of 1,073,257 records. Figure 3 Imputer: handling missing values is an essential step because many ML algorithms do not allow such values [26]. The imputer is an estimator used to complete the missing values by mean, median, or mode of numerical columns. In this case, the mean is used, which is calculated from the remaining values in the related column. − Vector assembler: Spark ML works in a way different from other systems. It operates on a single column rather than an array of different columns. The raw features are combined into a single vector to scale the data in the next stage. The vector assembler is a transformer that combines multiple columns into a single vector column. Figure 4 shows the result of this stage for the first record. The length 81 refers to the number of the remaining features after removing the nominal features, while the indices refer to the feature index. For example, in this vector, the fifth feature equals zero. Therefore, it is not included in the resulting vector.    Figure 5 shows the result of the feature selection stage for the first record. As we can see, the vector length changes from 81 to 71. Since ten features are removed using the variance threshold method. While it is an optional stage, it helps in reducing the convergence time. Figure 6 shows the result of this stage for the first record. Figure 6. The vector resulted from the standard scaler stage for the first record − String indexer: in this stage, the L7 protocol feature is mapped from string labels to a column of label indices. The order depends on label frequency; i.e., the most frequent label acquires index 0 and so on. For example, Google has a label of 0, hypertext transfer protocol (HTTP) has a label of 1. − ML model building: this is the final stage of the pipeline in which the ML models are built using the outcomes from the previous stages. Table 2 shows the input and output to/from each stage in the designed ML pipeline. Three ML algorithms available in the MLlib library are utilized: decision tree, random forest, and logistic regression. Although the gradient-boosted tree is considered one of the most prominent and influential ML algorithms [27], its multiclass version is currently not implemented in spark MLlib. Furthermore, the Naive Bayes algorithm could not be used because this algorithm requires non-negative feature values to work with, while the deployed dataset includes some negative values. In addition, Hyperparameters are configurations that specify the main structure of the model and influence the training process, namely model architecture and regularization. The hyperparameters of all the models are set according to Table 3. The deployed ML algorithms with their tuned parameters are presented, a) Decision tree The decision tree (DT) is a supervised learning algorithm that handles continuous and discrete data. Data in DT is split continuously according to a specific parameter. It is used to represent decisions and decision-making explicitly [28]. As the name suggests, DT is a tree-based model characterized by its simplicity in understanding decisions and the ability to select the most preferential features [29]. In addition, it can classify data without vast calculations [30]. b) Random forest Random forest (RF) comes under the supervised learning algorithms used in classification problems. It depends on ensemble learning that unites multiple classifiers to solve complicated problems and enhance the model performance. One of RF's strengths is its efficiency in handling massive training datasets [31]. Logistic regression (LR) is a predictive analysis, supervised learning algorithm for classifying categorical variables. It is built on the concept of probability. In LR, the output is transformed using the logistic sigmoid function to return a probability value. After all the stages are prepared, they are placed in the pipeline. Using the training DataFrame, the pipeline is fitted to produce the ML models to be evaluated and used in the deployment phase to make predictions. This is illustrated in part (b) of Figure 3.

Deployment phase
The deployment phase consists of the messaging system, spark streaming, and the ML model. The messaging system is responsible for transmitting traffic data from the SDN controller to spark streaming to perform the required analysis. The chosen messaging system should be scalable, fault-tolerant, elastic, and can transfer high volumes of data in real-time with low latency. Apache Kafka has all these capabilities [32]. In addition, it can be integrated conveniently with open network operating system (ONOS) and opendaylight (ODL) since these controllers have northbound plugins that allow real-time event streaming into Kafka [33], [34]. The SDN controller publishes traffic flow data as messages on Kafka using a common topic. Then, Spark streaming subscribes to that topic and acquires the message streams from Kafka. Spark streaming represents the analytics point that performs data cleaning and preprocessing to generate the required information for the ML model [35]. The ML models built in the learning phase are evaluated in terms of accuracy and speed.
Based on the evaluation results, the ML model with the best performance is used in the deployment phase to predict the application class. These predictions are utilized in making management decisions like resource allocation, routing, load balancing, to improve the network performance and lower the time required to detect security threats. The deployment phase implementation is currently beyond the scope of this work.

ML MODELS EVALUATION AND TESTING
This section evaluates and tests the ML models built in the learning phase. The pipeline implementation, evaluation, and testing are done on a laptop with CPU intel core (TM) i7 and installed memory (RAM) of 16 GB. The software tools are apache spark 3.2.1 and python 3.10. By default, spark driver memory is configured to 1 GB which is insufficient for this work since the training dataset size equals 1.64 GB. So, the amount of memory needed for the driver process is configured to 15 GB to prevent out of memory errors.

ML models evaluation
In this subsection, the accuracy of each ML model is computed. The model with the highest performance will be applied in the deployment phase. Accuracy is calculated according to the following [36], Accuracy = True Positives + True Negatives All Samples  Table 4 on the next page clarifies the evaluation metrics per class of the DT model with the support, the total accuracy, the macro average, and the weighted average. Support indicates the number of records per class in the testing DataFrame. The macro average is the arithmetic mean of (precision, recall, and F1) scores. It treats all the classes equally despite their support values. While the weighted average computes how many of each class there were in its calculation, taking into account each class's support. Thus, fewer of one class means that its score (precision, recall, and F1) has less impact on the weighted average for each of those things. Figure 7 shows the overall statistics for the ML models built in the learning phase. The DT model has the three candidates' best accuracy, precision, recall, and f1 score.  Figure 7. Evaluation metrics for the DT, RF, and LR models

Comparative analysis
A part of the evaluation process is to compare the proposed work with the state-of-the-art. Kuranage et al. [7] is considered for comparison since it used the same dataset [22] for training. To the best of our knowledge, [7] is the only work that used this dataset for SDN traffic classification and prediction. The comparison is fourfold. First, Kuranage et al. [7], did not apply the ML pipeline concept used in this work to automate the ML workflow. Second, Kuranage et al. [7], feature selection was applied manually so that only eight features were used in model building. While in this work, feature selection is applied based on the variance thresholding technique. Ten features are removed utilizing this technique (detailed in subsection 4.1). Third, spark is employed as a big data framework in this work for ML model building using the MLlib library and spark streaming for prototype testing. However, [7] didn't use any big data tool. Finally, the accuracy of the DT model is 0.98, as indicated in the previous section. In comparison, it was 0.95 Kuranage et al. [7].

The prototype testing
To test the ML models produced from the learning phase, there is a need to replicate the online data streaming. Therefore, a prototype is implemented in which the testing DataFrame is repartitioned into 1,000 different files; each file has approximately 1,072 records. Generally speaking, spark streaming accepts data from various sources (e.g., file source, Flume, Kafka) and processes it in real-time. Figure 8 illustrates the streaming process from a file directory as a data source for spark streaming. In the implemented prototype, Spark streaming listens to the file directory where the testing files are stored. Since the DT model has the best accuracy among the other models, it is used to predict traffic applications. Figure 9(a) shows life data streaming, where the number of streamed and processed records fluctuates around 300 records per second. This part sets the maximum number of files per trigger to 1. While in Figure 9(b), the number of streamed and processed records is nearly doubled since the maximum number of files per trigger is set to 2. The processing becomes faster since the number of files per trigger is doubled.  Figure 10 shows a sample snapshot of the streaming, including the actual label, the probability, and the model prediction. As seen in the first row, the vector in the probability column is [1,0,0., 0]. The first value in the vector is the probability of class 0 (Google), the second value is the probability of class 1 (HTTP), and so on. The model chooses the larger probability and designates the streamed data to the class with the larger probability. In the case of the first row, the model selects the larger probability, i.e., 1, and designates the streamed data to class 0 (Google), and it is correct compared with the actual label. In the second row, the model chooses the larger probability, i.e., 0.9923805704546593, and designates the streamed data to class 8 (Gmail), which is correct compared with the actual label. Figure 9. Input vs processing rate (a) max files per trigger = 1 (b) max files per trigger = 2. note: max files per trigger is the maximum number of new files to be processed in every trigger. Figure 10. Sample output of the label, probability, and prediction on the unseen data

CONCLUSION
This paper presented the architecture of a ML pipeline for online SDN network traffic classification using spark. The proposed model consists of two phases; learning and deployment. A pipeline is designed to streamline and automate the ML processes in the learning phase. Three ML models are built using the Spark ML library; decision tree, random forest, and logistic regression. These models are evaluated in four terms of evaluation metrics; accuracy, precision, recall, and f1-score. Results show that the decision tree model has the best accuracy with 0.98. Spark streaming is incorporated to stream the data to the elected ML model to replicate the online network traffic flow of data. In future work, the second phase of the proposed model, i.e., deployment, is intended to be implemented to utilize the valuable information acquired from the learning phase in diverse network management aspects, including routing, load balancing, and resource allocation.