IAES International Journal of Artificial Intelligence (IJ-AI)

Received Feb 4, 2022 Revised Oct 6, 2022 Accepted Nov 5, 2022 Facial expression recognition (FER) represents one of the most prevalent forms of interpersonal communication, which contains rich emotional information. But it became even more challenging during the times of COVID, where face masks became a mandatory protection measure, leading to the challenge of occluded lower-face during facial expression recognition. In this study, deep convolutional neural network (DCNN) represents the core of both our full-face FER system and our masked face FER model. The focus was on incorporating knowledge distillation in transfer learning between a teacher model, which is the full-face FER DCNN, and the student model, which is the masked face FER DCNN via the combination of both the loss from the teacher soft-labels vs the student soft labels and the loss from the dataset hard-labels vs the student hard-labels. The teacher-student architecture used FER2013 and a masked customized version of FER2013 as datasets to generate an accuracy of 69% and 61% respectively. Therefore, the study proves that the process of knowledge distillation may be used as a way for transfer learning and enhancing accuracy as a regular DCNN model (student only) would result in 46% accuracy compared to our approach (61% accuracy).


INTRODUCTION
In the field of computer vision, emotion recognition is a prominent area. Machine learning and deep learning techniques are now widely used, the possibility of developing intelligent systems capable of properly recognizing emotions became a reality. Facial expression recognition (FER) technology is important in artificial intelligence (AI) since it assists in understanding the internal states of people [1]- [6]. With the worldwide spread of coronavirus disease 2019 (COVID- 19), wearing face masks while interaction in public is becoming a common behavior to protect against infection. Thus, how to improve effectiveness of existing FER technology on masked faces has become an urgent issue.
Previous research has demonstrated that the mouth may provide a wealth of information about our emotions, such as the difference between fear and surprise or between sadness and disgust [7]. Therefore, current FER models will be less effective when a person's face is partially covered by a mask to more than half, including the nose and mouth. As an example, happy facial expressions have a key function in reducing anxiety in a physician-patient relationship. As a result, wearing face masks has a negative impact on the physicianpatient interaction. The physician's ability to assess the patient's sentiments and emotions will be compromised if the patient's face is covered. The patient may also miss the physician's expressions of empathy. This paper aims to study facial expression recognition for masked faces in order to further assist humans in interpersonal communication during COVID times. The goal is to construct a deep learning model based on a teacher-student architecture that uses knowledge distillation in transfer learning to distinguish facial expressions when there is occlusion due to masks. The model would be able to classify masked human faces based on the main seven facial expressions : sadness, happy, disgust, fear, surprise, anger, and neutral expression.
The outline of this paper is structured : section 2 examine existing approaches used in FER in general and under partial occlusion. Section 3 explains the methodology used in the study. Section 4 details our experiments and the results. Finally, section 5 draws some conclusions.

RELATED WORK
In recent years, researchers proposed several models to solve the issue of FER. The key difference between each model is dependent on how the architecture is modeled and how spatial-temporal approaches to processing image sequences are used. In this section, we present literature that is closely related to our study.

Facial expression recognition 2.1.1. Machine learning based approaches
Machine learning methods are combinations of feature extraction techniques and classifiers [8]. It is possible to extract features using geometric or appearance-based techniques. Appearance based techniques characterize the texture of the face by looking at the entire face as a whole, or at individual parts like the eyes, nose, and mouth. (E.g., [9]- [12]). Geometry based techniques may determine the contour of a face's appearance, along with its landmarks (such the eyes and nose) by tracking facial points. (E.g., [13], [14]). To categorize mood from extracted feature values, a variety of approaches employed support vector machines (SVM), K-neighbors neighbors (KNN), and random forest [15]- [17].

Deep learning approaches
The deep learning approach is considered a novelty in FER and on which several studies have been done. Before tackling the details of some of the approaches, below is an overview of the works previously undertaken on deep learning in FER. Deep learning algorithms utilize feature extraction as a way to discover and extract different characteristics. It has a multi-layered data representation architecture in which the network's lower layers serve as low-level feature extraction, and its final layers act as high-level feature extraction [18]- [20]. For processing videos, recurrent convolutional networks (RCNs) have been proposed [21]. For the processing of temporal information, convolutional neural networks (CNNs) are applied to video frames before being input into a recurrent neural network (RNN). With little training data, these models function well when the target ideas are complicated, but they have drawbacks when using deep networks. With the help of DeXpression [22], robust facial recognition has been attempted to solve this difficulty. Each block has layers like convolutional, pooling, and rectified linear units (ReLU). It achieves greater performance by combining numerous features instead of using single features.
A multitask global-local network (MGLN) for facial expression identification is suggested Yu et al. [23], which includes two modules : a part-based module (PBM) that learns temporal data from the areas around the mouth, nose, and eyes, and a global face module (GFM) that extracts spatial characteristics from the frame with the peak expression. Using a CNN and an long short-term memory (LSTM) network, GFM and PBM features are combined to capture substantial facial expression variance [23], [24]. More work was also done on the deep neural network e.g. [25] worked with an 18-layered CNN paired with four pooling layers. Moreover, [26] along with [27] in both of their papers have further developed deep face recognition systems. Ding et al. [26] introduced their new architecture FaceNet2ExpNet, on which [22] have worked and added transfer learning. For instance, [28] have applied transfer learning using AlexNet -a very strong pre-trained model having eight layers in total (five convolutional and 3 fully connected) -combined with SVM classifiers. In the same context, knowledge distillation was previously designed to compress a collection of deep neural networks [29]. The goal is to build smaller models (student models) that fulfill the same function as larger models (teacher models) with the requirement that the student model outperform the baseline model [30].

Facial expression recognition under occlusion
Numerous psychosocial studies were conducted after it was discovered that facial occlusion significantly affected FER. These studies aimed to identify the facial features that were crucial for human perception and recognition of facial expressions from partially obscured faces. Recent research has concentrated on how to use deep neural networks to immediately execute FER on covered face images, without including procedures like facial occlusion detection, feature extraction by hand, or classifier development [31]. Recent research has incorporated more forms of occlusion and datasets, however the majority of existant researchs are largely founded on a restricted number of artificially created types of occlusions, and progress is still modest. Many more papers were produced within the same context with more techniques; nonetheless we will be focusing on CNNs and DCNNs.

METHODOLOGY
The goal of this work is to recognize facial expressions on masked faces. In this part, a solution to the problem of performing face expression recognition has been put forth as shwon in Figure 1. Our approach is divided into three steps. First, we apply Face detection method which is responsible for detecting face in the input frames. Second, we use facial recognition to identifying or verifying the identity of a person. Finally we use emotion detection to classify the expression as one of the seven fundamental human expressions (angry, surprise, fear, disgust, happiness, neutral, sadness).

Face detection
Unprocessed face images contain enormous amounts of data, and feature extraction is necessary to condense this data into smaller sets known as features. Different techniques allow for face detection [32], we may use a histogram of oriented gradients (HOG) and linear SVM object detector, or even the built-in Haar cascades. Alternatively, we might also rely on deep learning-based algorithms to locate faces. Nonetheless, regardless of the method used to recognize the face in the image, it is more crucial to acquire the face bounding box. In fact, face detection using traditional feature descriptors and linear classifiers was very effective. In this paper we are going to be using dlib's face detection pre-trained model since the focus will be mainly on the facial expression detection model.

Face recognition
Face recognition is the second step after face detection. Once we retrieved the face boxes, we can move to matching them with our database. In order to perform this, we are using Dlib's face recognition module. This latter is a pretrained model that matches people's faces based on image embeddings. The face embedding of a person on a previously unseen image may be determined by calculating the distance between the face embedding and that of a known person, and if the face embedding is near enough to the embedding of person X, we can claim that the image includes the face of person X.
For Dlib face-recognition to be trained, a large number of pictures of people were used. A random vector is generated for each image at the beginning of training, so when the photos are plotted, they will be dispersed freely. Then, the model uses the triplet loss concept, in which it picks each time a reference image, a positive element (matching image) and a negative element (unmatching image). The ResNet network used in Dlib face-recognition has 29 convolutional layers, is based on the ResNet-34 network, but some layers have been eliminated, and the number of filters per layer has been reduced by half.

Facial expression recognition 3.3.1. Deep convolutional neural network
The strength of DCNNs is in their layering. A DCNN processes the image's red, green, and blue elements all at once using a three-dimensional neural network. Compared to standard feed forward neural networks, this drastically lowers the quantity of artificial neurons needed to process an image. The architecture of a convolutional network typically consists of four types of layers: Convolutional layers, pooling layers, nonlinear layers, and fully connected layers. Convolution layer is used to extract low-level characteristics such as edges. It is composed of several convolutional filters (kernels). The output feature map is produced by convolving these filters with the input image, which is expressed as N-dimensional metrics. Second, the pooling layer decreases the feature resolution. Features are more resistant to noise and distortion as a result. You can pool in two different methods -the maximum pooling method and the average. Third, a non-linear "trigger" function is used to indicate different identification of probable features on each hidden layer of a neural network. This non-linear triggering may be easily implemented by using a range of specialized functions, such as ReLUs and continuous trigger functions (non linear). The last fully connected layers compute the class scores on the entire original image. Because there are so many variables to modify in DCNN, training a large DCNN model can be challenging. A vast network frequently needs a huge amount of training data as overfitting can occur while training with small or insufficient quantities of data. It can also be challenging to get enough data for the DCNN to be properly trained for some scenarios. Furthermore, in certain situations, a large volume of data is not easily available. Needless to add that network structures have grown nowadays from a dozen of layers to hundreds of layers, while trying to score high performance and compensate for the lack of data. For instance, DCNNs have moved from AlexNet with 8 layers to ResNet with 152 layers. So, training a deep network with numerous parameters would demand massive computational resources and would be impossible to run on smaller devices with smaller processors and memory. Nonetheless, transfer learning and knowledge distillation have proved to be able to solve both challenges.

Transfer learning
Transfer learning is a machine learning approach in which a model that has been trained for a specific function is reused for a new one. In classic deep learning scenarios, a model can perform poorly if it was reused in a new context as the model is biased from the original data. Transfer learning for deep machine learning is a procedure of first trained on a benchmark dataset, and the best-learned network feature (the network's weights and structures) is then transferred to a second network that will be trained on a target dataset.

Knowledge distillation
The main goal of knowledge distillation (KD) is to produce a smaller model (student) that is capable of solving the same issue as a larger model (teacher), with the caveat that the student model must outperform the regular traditional model [33]. KD uses a teacher-student architecture. The architecture of a neural network is set up to allow us to apply a "softmax" activation function and obtain the probability of the classes that are being classified. According to [34], the general equation for such an output layer is yi = exp(xi/T)/∑j exp(xi/T) Where j is the number of classes, xi is the logit, and yi is the class probability. The temperature value, T, is indicated here and is typically 1 [34]. A higher temperature value represents a softer probability distribution for the classes. The basic goal of distillation is to move this knowledge from the large model to the smaller model by transferring these probabilities. The student model was trained Bucilua et al. [34] to produce accurate labels in addition to the teacher's soft labels by using a weighted average of two functions. Cross entropy with correct labels is the first objective function, but cross entropy with soft labels is regarded as the second objective function. The distillation process is a spread by the custom loss function like in (1),

LKD = T2KL(ys, yt)
Here, T is the temperature value, ys, yt are the targets softened for the student model, and LKL is the built-in Kullback-Leibler (KL) divergence loss, LCL is the normal cross-entropy loss. The hyperparameter accentuates the difference between the two loss functions' weighted average. Different methods can be used to apply the knowledge distillation framework to networks; in this paper, we employed offline distillation.   Figure 2 shows our proposed technique for FER for masked faces. Our contribution concerns to incorporate knowledge distillation method in transfer learning between a teacher model, which is the full face FER DCNN, and the student model, which is the masked face FER DCNN via the combination of both the loss from the teacher soft-labels vs. The student soft labels and the loss from the dataset hard-labels vs. The student hard-labels.

A. Full face detection, recognition, and expression recognition
Once the dataset has been obtained, hybrid sampling is employed to ensure equality across all classes in the dataset. On the dataset, face detection is done in order to maintain the face region alone and remove any extraneous data. Unmasked faces can be found in this new dataset. Deep neural network algorithms are then utilized to extract the pertinent information prior to recognition. To identify FER in the full face, we are using a DCNN model. The suggested model is: a feature extraction part is comprised of 3 sets of convolutional layers, each followed by a pooling layer and a dropout layer. Each convolutional set has 2 cascading convolutional layers with their corresponding activation layers. For bigger and deeper networks, this is typically a good approach, because several stacked convolutional layers can create more complex features in the input volume before the pooling process. Then, the network ends with two fully connected layers. This means there are a total of 6 convolutional layers and 2 fully connected layers in our model. This approach greatly reduces the number of parameters that may be trained. A result of the successive convolutions in each layer is that the output becomes increasingly sensitive to tiny changes in the input, which ultimately helps classification.

B. Masked face detection, recognition, and expression recognition
Taking into consideration that masks are a type of occlusion of the face, face recognition under occlusions can be handled in two different ways: techniques that characterize the face despite the occlusions, and techniques that reconstruct the occluded portions of the face in order to recover the ideal analysis environment. In order to allow for FER, we will be using an offline knowledge distillation method with teacherstudent architecture in which the student has the same size as the teacher. The teacher would be a 'homemade' pre-trained model that had learned FER on full faces. Nevertheless, the knowledge distillation would not be used in order to reduce the size of the teacher model but rather to transfer the learned knowledge that the teacher acquired from training on full faces to help guide the student in the masked faces learning. Hence, knowledge distillation would be used in transfer learning from teacher to student. The transfer learning would not be via attribution of teacher network's weights to the student, but rather through the soft labels' loss. The choice of this method was to allow the student to initialize and update its weights based on the hard labels from the dataset and the soft labels from the teacher model. We would not be initializing the model with the teacher weights, since the new task to solve is different than the task solved by the teacher network and we only want the teacher to guide the student through the training. This is similar to how humans are attempting to recognize facial expressions now with masks. Finally, the model would be able to classify masked human faces according to the basic 7 facial expressions (disgust, fear, happiness, sadness, angrer, surprise and neutral face).

Dataset
In this study we used FER2013, the facial expression recognition (FER2013) [35] database consists 35,887 grayscale images of 48×48 resolution, split in 28,709 images of trains, and the validation and test sets include 3,589 images each. Each of the 35,887 data points in the FER-2013 dataset has been assigned a label based on seven emotions. We also tried to compensate for data challenges, hence we performed data augmentation to train model on invariances on small transformations. We used rotation up to 45 degrees, width shifts and height shifts up to 0.1, horizontal flips, zooming up to 0.2. We also normalized images (pixel values) as they can impact the neural network if not normalized. Figure 3 shows a part of dataset. Since datasets of masked faces for FER were not available, we decided to create our own dataset based on the same FER2013 dataset and the mask the face script created by Anwar and Raychowdhury [36]. It is a software that can be used to add occlusion to faces in images and mainly in the form of masks. For the application of the mask, it utilizes a Dlib-based face landmarks detector to find facial features. Figure 4 shows an example of the results obtained.

Anger
Disgust Fear Happiness Neutral Sadness Surprise

Experiments 4.2.1. Experiment 1: full face model
In this stage we start first by detecting all faces on images by using the default model of Dlib's face detection which is based on HOG and linear SVMs. Second, we are using the pre-trained model of Dlib, incorporated in the face_recognition library, to generate the 128-dimensioned embeddings for our dataset, i.e., the faces that can be recognized by the program. Next, the library's recognition function uses the KNN algorithm to do the matching and classification of faces. Once the faces detection and recognition phases are prepared, the next step is FER. We built a DCNN model. We used the kernel initializer 'he_normal' to initialize the network weights. We employed a NAdam optimizer for optimization. The chosen learning rate 0.001. Moreover, to face the main deep learning challenges presented above, we have added frequent dropouts in between each convolutional set to avoid overfitting and add more generalization to the model. We also used early stopping to avoid overfitting or underfitting caused by choosing the wrong epoch number. Overfitting of the training dataset can occur when there are too many epochs, and underfitting can occur when there are too few. This problem can be solved by early stopping because we may provide a high number of training epochs and the model will stop once the performance ceases to increase. We conditioned Early stopping on the validation accuracy metric.
Another key element is using the best weights from training. By default, the model weights obtained during the final training stage are adopted. Nonetheless, early stopping provides the option of restoring model weights from the epoch with the best value of the monitored quantity. So, we set restore_best_weights as true.

Experiment 2: masked face FER
For the masked faces, the teacher and student architecture are the same as the full-face model. Masked face FER without KD: in this experiment, we using the same steps as ran a model without knowledge distillation directly on masked faces dataset. The teacher model had learned FER on full faces. Masked face FER with KD: in this experiment, there is no need to train the teacher model again since we have already performed it. A new distiller (student, teacher) class is created, and it replaces the model methods compile (), train_step and test_step. The distiller employs the trained teacher model, the student model that will be trained, the student loss function, the distillation loss function, along with the selected temperature and an alpha factor to weight the student and distillation loss. In the train_step method, we perform a forward pass of both the teacher and student, calculate the loss with weighting of the student_loss and distillation_loss by alpha and 1alpha, respectively, and perform the backward pass. Note: due to the fact that only the student weights are updated, we only compute the gradients for the student weights. Using the provided dataset, we evaluate the student model using the test step method. Thus, we instantiate the distiller and compile it with the required losses, and hyperparameters. We set alpha to be 0.1 and Temperature to 10.

Results
In this section we given the results obtained after tested our models. FER models tested are evaluated with different evaluation metrics. Both the teacher and the student models were trained for 100 epochs and it took around 6-8 h to train them on the FER2013 dataset and the masked FER2013 respectively. Full face FER: teacher results: after evaluating our experiment, we obtained the results the Figure 5. The result demonstrates the behavior of model with accuracy and loss during training and testing on 100 epochs. The epochs' history shows that accuracy gradually increased and achieved 69% accuracy on both training and validation dataset, but the model stopped before reaching 100 to avoid overfitting. The confusion matrix as shows in Figure 6 shows that our model is performing well on the Happiness class, followed by surprise, neutral, sadness, anger, fear, and the last one is disgust. One of the reasons for disgust coming last is that the class has less data. For fear, as we can see on the confusion matrix, major parts of the test dataset are misclassified as surprise, or sadness, or even anger. This in fact makes sense since fear can exhibit a mixture of all these emotions at once, thus having each human react differently.
The classification report as shown in Table 1, confirms as well what the confusion matrix states. Happiness has the highest precision, recall, and F1-score. The average accuracy of the model is 69%. We ran a model without knowledge distillation to test the impact our approach will have. The student had the same structure as the teacher model and ran directly on masked faces dataset. The model reached an accuracy of 46%. Results are the Table 2.
Masked face FER with KD: student results: after evaluating our experiment, we obtained the results the Figure 7. The result demonstrate the behavior of model with accuracy and loss during training and testing on 160 epochs. The epochs' history shows that accuracy gradually increased and achieved 61% accuracy on both training and validation dataset, but the model stopped before reaching 100 to avoid overfitting.    happiness. The model was trained using all 28,709 images in the training set, which was then tested against 3,500 validation images and reported on for accuracy using 3,589 test images. On the full face dataset and the masked face dataset, we were able to attain accuracy rates of about 69% and 61%, respectively. The table as shows in Table 3 compares the output of our model with a few earlier studies on the FER2013 full face.

CONCLUSION
In this article, we proposed a method to detectet FER with masks based on a teacher-student architecture using knowledge distillation in transfer learning. The study proved that using both the distillation loss and the student loss as a form of transfer learning can be beneficial as it upgraded the model accuracy from a regular DCNN with 46% accuracy to a student model with 61%. Hence, based on a 69% accuracy teacher that learned facial expressions on full faces, the student managed to score 61% on masked faces (lower face occluded). Moreover, comparing the teacher model with other major models ran on the same dataset, our model showed significant accuracy, knowing that the FER2013 dataset pose a greater challenge than they do in other FER datasets as it encompasses the difficult naturalistic conditions. In fact, the human performance on this dataset is estimated to be 65.5%. For future work, we plan to explore the context to improve our performance in occluded FER.