Improve malware classifiers performance using cost-sensitive learning for imbalanced dataset

ABSTRACT


INTRODUCTION
Malware is malicious code designed to install covertly on a target system.The malicious intention could be: destroying data, installing additional malicious programs, exfiltrating data, or encrypting data to get a ransom [1].Malware compromise the confidentiality, integrity, and availability of the user's data.The landscape of malware is constantly evolving.In the past, malware was typically created to be fast and easily detectable, often carrying out destructive actions shortly after infecting a system [2], [3].Older types of malware had specific procedures for dealing with different types of infections.However, today's malware is designed to be stealthy and difficult to detect.It spreads slowly over time, gathering information over a longer period before exfiltrating it.Modern-day malware tends to utilize a single set of procedures, as most attacks are blended and incorporate multiple methods [4], [5].
In cybersecurity, the use of artificial intelligence (AI) is being necessary [6]- [8].Many works are focusing on solving the imbalanced data issue in literature [9], [10].Since, most algorithms are designed to work well with balanced databases.Recently, most researcher work with malware visualization technique.This method deals indirectly with the malicious code.The main idea is to visualize a malicious binary executable as a grayscale or colored image.These images are presented as arrays in the range of (0, 255).This technique was initiated the very first time by Nataraj in 2011 [11], where they deliver the Malimg database which contains ISSN: 2252-8938  Improve malware classifiers performance using cost-sensitive learning … (Ikram Ben Abdel Ouahab) 1837 directly 9,369 malware images in 25 classes.Researchers use many methodologies [12], [13], where the common thing is the very first step of malware visualization (dealing with images).A malware detection method was proposed by Di Wu [14], which utilized cascading extreme gradient boosting (XGBoost) and costsensitive techniques to handle unbalanced data.The method used extracted application programming interface calls (API) from portable executable (PE) files as features, and adopted a three-tier cascading XGBoost approach for data balancing and model training.Di Wu used a database that contained two classes for malicious and benign API calls, achieving a high accuracy of 99% with this method.In a separate study, Roland Burks [15] incorporated generative models to generate synthetic training data for malware detection.Two models were utilized -the generative adversarial network (GAN) and variational autoencoder (VAE)-with the goal of improving the performance of the residual network (ResNet-18) classifier.The addition of synthetic malware samples to the training data resulted in a 2% accuracy improvement for ResNet-18 using VAE, and a 6% accuracy improvement using GAN.
In this paper, we perform malware classification into 25 malware families.To deal with imbalanced data we proposed a new approach to calculate weights as part of the cost-sensitive learning application.Then, we evaluated the proposed approach using two different convolutional neural networks (CNN) models that we developed from scratch using functional and subclassing Keras API.We compare the proposed weights approach with classical approach such as weights calculated using sklearn and random weights value.The overall goal of this work is to increase the performance of the classifier while working with imbalanced data.Finally, we reach our goal and our proposed weights approach performs better than the other techniques and better than without using any cost-sensitive learning approach.This manuscript is structures: First, an introduction to malware classification challenges, and how researchers deal with imbalanced data.Second, the proposal description.Third, we defined methods and materials used in the whole approach, then, experimentations and obtained results.Finally, we discuss these results, compare them with others in literature and conclude with future perspectives.

PROPOSAL
This article's contribution is to propose a weights approach for cost sensitive to deal with imbalanced data in general and malware image data in particular as shown in Figure 1.We demonstrate that the classical used weight is not effective in the case of too many classes, as we have 25 classes.Then, we evaluated our approach using two CNN models; with functional and subclassing APIs.All the experiments have a common goal to detect and classify malware variants effectively into their corresponding families.Then, we could see clearly the improvement between classical weights and the proposed weights for 25 classes as a use case, in term of classification metrics.So, our main contribution includes, − Proposing a customized weight for Cost sensitive to deal with Malimg imbalanced database.

−
Evaluate the cost sensitive approach, using two CNN models and compare with classical approach.

Image representation of a malware
Malware visualization is an area focused on detecting, classifying, and presenting malware features in the form of visual cues that can be used to convey more data about a specific malware type.Visualization techniques can use to display static data, monitor network traffic, or manage networks.In [16], the visualization technique is used to discover and visualize malware behavior.Recently, researchers focus on the development of orthogonal methods motivated by signal and image processing to deal with malware variants.They took advantage of the fact that most malware variants have a similar structure, since new malware are simply a variant of existing one's in most cases.So, a malware is treated as digital signals and apply Signal and Image Processing techniques.These techniques are proved to be effective in malware classification and detection in many researches [17].The traditional way to view and edit malware binaries is by using Hex editors, which show us the byte by byte of the binary file in a hexadecimal format.In [11], authors proposed a new method to view binary files as grayscale image or signal.A malware binary is read as a vector of 8bits unsigned integers as shown in Figure 2.These integers are then organizers to be presented as 2D array.So that, it can be viewed as grayscale image in the range of [0-255].After converting a malware binary to grayscale image, the image itself keep a significant structure as described in an older work [18].The binary fragments of a malware show special image textures, and that allow as to classify malware images effectively since years.

Database: Malimg
Malimg stands for malware images.It's a wide used database [19], in malware classification contexts.Most works cover malware images classification using machine learning and deep learning models.In our case, we have already large data (Total of 9,369 and 25 families); however, the problem is that these data are not balanced as shown in Figure 3.Some classes having a lot of samples; more than 1,200, while others having less than 100 sample.This difference creates an imbalanced dataset.One of the rules in machine learning and deep learning is to balance out the data set or at least get it close to balance.The main reason for this is to give equal priority to each class in laymen terms.

Cost-sensitive learning approach
Methods for addressing class imbalance can be divided into three main categories.Data level preprocessing: that operate on the training dataset and change its class distribution using resampling techniques.These methods aim to alter datasets in order to make standard machine learning algorithms work.Cost-sensitive learning: here we keep the training dataset unchanged and assign different penalties to the misclassification of samples.Therefore, this will cause the machine learning algorithm to pay more attention  [20]- [23].
In general, the goal of a machine learning algorithm is to minimize the cost function of loss function (1).In cost-sensitive learning, we modify this cost function to take into account that the cost of a false positive and a false negative may not be the same.We have below the standard cost function for the logistic regression classifier also known as binary cross entropy loss.In logistic regression, we call the positive class 1 and the negative class 0. These values are just for convenience, and doesn't really matter what numerical values we give to each class since we're using two different numbers. Where: is the size of training samples   is the actual labels  ̂ is the predicted probability −   ( ̂) present the cost for   = 1 (minority) (1 −   )  (1 −  ̂) present the cost for   = 0 (majority) For the modified cost function (2), we define two class rates w1 and w0 to incorporate the significance of each class in the cost function.In general, wj is defined as the total number of samples over the number of classes times the number of samples in each class j. Where, is the total number of samples  is the number of classes in the dataset   is the number of samples in class  We take as example a binary classification.To see the difference between the original cost and the modified cost, let's first look at the case of balanced dataset.In this case, we have  1 =  0 =  2 .If we plug in these two values into the previous equation then, here, w1 and w0 are identical, which means that we are putting the same weight for making mistakes in terms of false positive and false negative.However, if these two classes are imbalanced, that's mean the minority class has for example 10% of total number of samples.And the majority class has 90%.Then we can plug in these values again into the equation for the weight parameter: We got here that the weight of minority class is greater than the weight of majority class.Based on the cost function that we have before, that's mean we are paying more attention to the minority class.Moving to weights calculation.First, we use the sklearn function in order to compute weights.This function is an implementation of the previous formulas, so there is no need to redo it.As presented in Table 1, these weights are tiny in the range of 10-6.After, that we use random values of 1 and 2 for all the 25 classes.These two weights methods are not effective in our case.That lead us thinking of a new way to compute weights.
In general, we have the simplified formula of weight for class i ( 5).As planned, our weights respect the ordering of classing.Based on the new weights given in Table 2, we can say that the model will give more attention with high weight (class E), and it will give less attention to classes with low weights (class A). Where, is the number of classes   is the percentage of class i over database

Convolutional neural network
TensorFlow provides 3 methods for building deep learning models: Sequential API, functional API, and model subclassing.Model subclassing is a high-level API style using pure oriented object programming concept used rarely.This method gives it user the change to customize everything.In contrary of sequential and functional APIs, the model subclassing provides a full control over every nuance of the network and the training process.The first CNN model is composed of different layers types including: Con2D, MaxPooling2D, ZeroPadding2D, dropout, flatten and dense.The construction is given in

Tools
Powerful hardware is essential for image processing, and in our laboratory, we use the NVIDIA Quadro T1000 with Max-Q graphics processing unit (GPU) workstation due to its high compute capability of 7.5, which allows us to process images quickly compared to other devices.For deep learning using TensorFlow, we found that installing compute unified device architecture (CUDA) and CUDA deep neural network (cuDNN) on the GPU environment was necessary.We also installed additional required Python packages.While attempting to use a simple Conda command for installation, we encountered numerous errors, leading us to recommend a manual installation and configuration to save time and ensure successful installation.A manual installation guide for TensorFlow can be found in reference [25].

RESULTS AND DISCUSSION
As a result of using cost sensitive, we demonstrate that effectively cost sensitive technique allows us to improve the performance of malware classifier.We compare four evaluation metrics using 2 deep learning models, a functional CNN model and a subclassing CNN model.Then, for both models we used different costsensitive methods: our proposed weights, random weights, sklearn function-based weights, and without costsensitive.Each time, we computed classical evaluation metrics: loss, accuracy, precision and recall.Hence, the best performances go to the subclassing CNN model with cost sensitive using our proposed weights approach.The loss is 1%, the accuracy is 98.46%, the precision is 98.5%, and the recall is 98.42%.These values retain to be the best over several experimentation tests.In addition, cost sensitive using out proposed weights approach gives also best results comparing to the other methods for the function CNN model.So, we proved the efficacy of this approach in the case of many classes, 25 classes in our case.However, classical weights calculated from sklearn function are the worst with the first model, and average with the second model.Results are presented in Figure 6 and Table 3.So, customized weights are effective with Malimg imbalanced database.In general, we can apply the same approach to deal with any imbalanced data.The point here is to not use default weights especially, when we are working with multiclass database, we can rebuild classes to calculate weights.The proposed weights approach here gives a details calculation for 25 classes.In other context the same idea could be applied.In literature, most of works using cost-sensitive in different domains and application shows the improvement while using this technique to deal with imbalanced data [26], [27].For instance, in [14] the use of cost-sensitive was implemented for binary classification, and the obtained result reach 99%.Then, for Malimg imbalanced database, researchers in [15] used GAN which also give acceptable results (90%).In our paper, we proposed the cost-sensitive weights approach to deal with Malimg imbalance data and the given result is 98%.The most important thing while doing this is that all classes have same attention and weights, so, even if a class has few samples, we gave it a good weight and the classifier was able to recognize this class more effectively than before.The final performance of the overall model without cost-sensitive or any other technique that deal with imbalance data, could be very high, but when we give in details, we found that the model lack to recognize effectively or with high accuracy some classes (mainly those with less data) [28], [29].

CONCLUSION
Summing up, in this work, we investigate cost-sensitive learning for advancing the classification of imbalanced data.A new cost-sensitive weights computation was proposed and evaluated using 2 CNN models along with evaluation metrics.The main goal is to improve the performance of malware classification into their corresponding families.So, we proposed a new approach for cost sensitive using customized weights approach to deal with unbalanced database.We order the classes by the number of samples, then we make subclasses where each new class englobe 5 of the malware classes.Then we compute weights, here the new weights will be given to all malware families belonging to the new subclass.The idea is to give more attention to classes having few samples.After that, we compare the proposed weights to the classical computed weights.When applying the proposed weights, the model performance improved clearly using both CNN models one by one.As a conclusion, we recommend to use customized weights in the case of many classes e.g. 25 classes, in order to improve the performance overall, and especially the performance withing minority classes.The best results in this paper is related to the customized approach of cost sensitive with CNN subclassing model where we have improved the accuracy with +0.1% (and so with other metrics).As future work, we aim to develop a framework based one our methods to defend again malwares using malware images and deep learning.Also, we are looking forward to use GAN as data augmentation technique and compare it to actual findings.Moreover, we found that the very right way to do CNN models is using subclassing API.It gives the developer lots of possibilities to customize literally everything.We are looking forward to dive in deeper in this context and propose a customized layers and functions.

Figure 1 .ISSN
Figure 1.Workflow of the proposed solution

Figure 4 .
We train and evaluate the ISSN: 2252-8938  Improve malware classifiers performance using cost-sensitive learning … (Ikram Ben Abdel Ouahab) 1841 model without cost sensitive.Here we use Keras TensorFlow functional API.The model architecture is simple with known layers as shown in Figure 4.The second model architecture is given in figure below.First, we build the CNNBlock.Second, we create the ResBlock base on the previous block.Third, we perform the global malware detection model which contain the previous blocks in addition to other wide known layers as MaxPooling, flatten, and dense.We train, and evaluate the model without cost sensitive and using default then the proposed weights.Here, we have more flexibility and options to customized in term of coding as shown in Figure 5.

Figure 6 .
Figure 6.Loss, accuracy, precision, and recall curves using various techniques

Int
Improve malware classifiers performance using cost-sensitive learning … (Ikram Ben AbdelOuahab)    1843 Ensemble learning: combines multiple techniques from one or both categories (data level preprocessing and/or cost-sensitive learning).Hence, this methos is broadly referred to as ensemble learning and it can be viewed as a wrapper to other methods Improve malware classifiers performance using cost-sensitive learning … (Ikram Ben Abdel Ouahab) 1839 to samples from the minority class.

Table 2 .
The proposed weights approach

Table 3 .
Results of cost-sensitive implementation