Multiple face mask wearer detection based on YOLOv3 approach

The coronavirus disease 2019 (COVID-19) is a highly infectious disease caused by the SARS-CoV-2 coronavirus. In breaking the transmission chain of SARS-CoV-2, the government has made it compulsory for the people to wear a mask in public places to prevent COVID-19 transmission. Hence, an automated face mask detection is crucial to facilitate the monitoring process in ensuring people to wear a face mask in public. This project aims to develop an automated face and face mask detection for multiple people by applying deep learning-based object detection algorithm you only look once version 3 (YOLOv3). YOLOv3 object detection algorithm was concatenated with different backbones including ResNet-50 and Darknet-53 to develop the face and face mask detection model. Datasets were collected from online resources including Kaggle and Github and the images were filtered and labelled accordingly. The models were trained on 4393 images and evaluated based on precision, recall, mean average precision and the detection time. In conclusion, DarkNet53_YOLOv3 was chosen as the better model compared to ResNet50_YOLOv3 model with its good performance on accuracy with a mAP of 95.94% and a fast detection speed with a detection time of 50 seconds on 776 images.


INTRODUCTION
A newly discovered disease, called the Coronavirus disease 2019 (COVID- 19) was first identified in Wuhan City, Hubei Province in China. It is a highly infectious disease caused by a novel coronavirus (SARS-CoV-2) and has caused approximately 173 780 789 confirmed cases worldwide with 3 737 818 deaths (till 5 th June 2021) worldwide [1]. According to WHO, transmission of the virus can occur through droplet transmission, close contacts, airborne transmission, and fomite transmission [2]. During droplet transmission, the respiratory droplets or secretions will be exhaled from the infection patient when he or she coughs or sneezes. When another person comes into close contact (within 1 meter) with the infected patient, they will be transmitted into the eyes, nose or mouth of the person which leads to infection [3]. Airborne transmission normally occurs during aerosol-generating medical procedures [2]. People also will get infected by coronavirus by touching the eyes, nose, or mouth after touching the contaminated surface. To minimize the transmission of SARS-CoV-2, various preventive measures has been implemented including surface disinfection, improving hand hygiene, and practicing social distancing. As COVID-19 transmits through droplet transmission and close contact, wearing a mask especially in public is crucial to break the chain of Int J Artif Intell ISSN: 2252-8938  Multiple face mask wearer detection based on YOLOv3 approach (Cheng Xiao Ge) 385 transmission of the virus. In general, the types of mask used in fighting the COVID-19 are surgical mask, filtering facepiece respirator (FFP) and non-medical masks. To protect the people from the infection, governments has made compulsory for wearing mask in the publics in many countries. Surveillance on the public in ensuring people are wearing face mask in accordance with the law enforcement is essential to reduce the risk of transmission efficiently. However, human surveillance is burdening to monitor large group of people on wearing of face masks in the public as well as maintaining a contactless manner between the surveillant and the public. Therefore, it is necessary to develop an automated face mask detection in public especially in crowded places to facilitate the monitoring process in ensuring people are wearing a face mask. Deep learning is a subfield of machine learning and has achieved a significant development with its deep artificial neural network in recent years [4]. The vast improvement of deep learning has brought notable performance for visual recognition systems involving image classification, image localization and object detection [5]. In fact, deep neural networks are unable to tackle multidimensional input efficiently such as an image, as the parameters during training will be massive which is impractical [6]. As an alternative, convolutional neural network (CNN) was introduced to interpret image data and perform image classification related tasks in several area such as algriculture [7], [8], medical imaging [9]- [12] and security and surveillance [13]- [16]. It has been extensively used in visual recognition with its great capability in feature extraction and classification with single-object-image [17]. Instead of image classification, there are several deep learning-based object detection approaches formulated CNN as a core network in identifying the class of detected object. For example, region with CNN (R-CNN) [18], fast region with CNN (Fast R-CNN) [19], faster region with CNN (Faster R-CNN) [20], you only look once (YOLO) [21] and single shot detector (SSD) [22]. R-CNN requires high computation to perform object detection and is time-consuming. Both Fast R-CNN and Faster R-CNN performed well in accuracy, however, speed is compromised which is not suitable for real time detection [23]. For YOLO object detection, it is performed with a single neural network in a single evaluation and is fast and accurate [17]. In YOLOv2, the model has outperformed the other detection frameworks including Fast R-CNN, Faster R-CNN and SSD, with mean average precision (mAP) of 78.6 and speed of 40 frames per second (FPS), providing excellent results between speed and accuracy [23]. In comparison with YOLOv2, YOLOv3 improved its architecture proving a better accuracy and better performance on small-sized object detection [24], [25].
In general, deep learning-based approach for face mask can be generalized into: i) image classification; and ii) object detection. In image classification, CNN is established to distinguish between facemask and non-facemask image. For instance, Militante and Dionisio [26] used the CNN model called InceptionV3 through tranfer learning approach for classifying the facemask image together with social distancing. A dataset of 20,000 images consists of face mask wearing and non-face mask wearing was collected from Bing Search application programming interfaces (APIs) and taken in University of Antique. The training achieved an accuracy of 97% and the system was able to signal the alarm upon the detection of person who did not wear mask and observe social distancing. Other than that, the study in [27] also proposed InceptionV3 for facemask classification. The simulated masked face dataset (SMFD) dataset with a total of 1570 images were used, consists of 785 simulated masked facial images and 785 unmasked images. In this study, the last layer of the InceptionV3 model was removed and replaced by 5 more layers to the network. The last layer was the softmax activation function for the classification of mask wearing or non-mask wearing person. A comparison had been done between the proposed model and other machine learning and deep learning models including VGG-16, VGG-19, Xception, MobileNet and MobileNetV2. InceptionV3 outperformed other models and achieved accuracy and specificity of 100% and 100% during testing. Loey et al. [28] proposed a hybrid between deep transfer learning and machine learning. ResNet-50 was used as first component for feature extraction in the proposed model and classical machine learning including decision trees, support vector machine (SVM) and ensemble methods were used for classification purpose. Three datasets were used which are real-world masked face dataset (RMFD), simulated masked face dataset (SMFD) and labeled faces in the wild (WLF), each consists of 10,000 images (With and without masks), 1570 images (With and without masks) and 13,000 images (With masks only). As a result, SVM classifier outperformed the other classifier as it achieved 99.64% testing accuracy in RMFD, 99.49% in SMFD and 100% in labeled faces in the wild (LFW), with least consumption of time.
In object detection, facemask location is detected from the image based on the established deep learning-based object detection. For example, Roy et al. [29] used different object detection algorithms including YOLOv3, YOLOv3Tiny, SSD and Faster R-CNN were used for the detection of face and face mask. The models were trained using a dataset of 3000 images, where 678 images are from Kaggle datasets of medical face mask, 757 images consist of close-up faces and the rest 1565 images are from Google. As a result, YOLOv3Tiny was the perfect fit for the application, with a mAP of 56.27% and FPS of 138, where accuracy and speed are well-balanced. Loey et al. [30] used ResNet-50 deep transfer learning model for feature extraction process and YOLOv2 for medical face mask detection. A total of 1415 images were used for training with 853 images are from face mask dataset and 682 images from medical mask dataset. The average precision achieved 81% and Adam optimizer is integrated in this study. Other than that, the works in [31], [32] proposed a face mask detection based on YOLOv3. Bhuiyan et al. [31] used a total of 600 images with 300 of masked and 300 of non -masked images achieved an accuracy of 96% with mean average precision score of 0.96. However, Ren and Liu [32] proposed YOLOv3 based model called Face_mask Net and perform the comparison of four loss functions which are Intersection over union (IoU), generalized IoU (GIoU), distance IoU (DIoU) and complete (CIoU). The dataset used has of a total of 9056 images consisting masked and non-masked images collected from WIDERFace, masked faces (MAFA), RMFD, web crawlers and video screenshots. Face_mask Net performed better than other networks in multi-target detection and is able to maintain accuracy at 99%.
Based on previous studies [29], [31]- [33], improvements can still be made on improving the accuracy as well as training a more robust model using bigger dataset. In additional, no studies have been done in comparing YOLOv3 model with different backbones such as DarkNet-53 and ResNet-50 on face mask detection. Thus, this project will be focused on formulating YOLOv3 for developing a more robust detection model using bigger and reliable dataset to achieve higher accuracy and speed as well as comparing the performance of both models with different backbones. As compare to the previous study, this study used the image samples obtained from six different dataset or studies which are face mask detection [34], medical mask [35], face mask dataset [36], face mask detection mask dataset [37], COVID face mask detection dataset [38] and correctly masked face dataset [39].

METHOD
The aim is to develop automated face and face mask detection model based on YOLOv3 object detection model using different deep pretrained CNN feature extraction model which are Darknet-53 and ResNet-50. The performance of each model in face and face mask detection is evaluated based on precision, recall, mean average precision (mAP) and detection time. The general method involved were illustrated according to the block diagram in Figure 1.

Data acquisition
The dataset with classes of masked and non-masked were obtain from online resources including Kaggle and Github as shown in Table 1. In total, there were around 17418 images from the 6 chosen dataset. As some of the images from the dataset were collected from the same source, there will be some duplicated images. Therefore, filtering of images was done beforehand. The selection of data images was based on the type of masks worn by the person, where only person who wears surgical masks, fabric masks, filtering facepiece respirators such as N95 will be selected in this project. Masks that covered full face of the person such as powered air purifying respirator were excluded. In addition, masks with complicated patterns were also filtered out. Some of the images from the dataset also consists of irrelevant images such as close-up facial images and images with low resolution, these images were filtered out as well. The images were filtered manually. Figure 2 represent the sample images that are selected for final dataset.

CNN models development and configuration
In this project, pretrained ResNet-50 as shown in Figure 3(a) and Darknet-53 model as shown in Figure 3(b) were used where both models were trained on ImageNet dataset with 1000 classes. To concatenate with YOLOv3 as shown in Figure 3(c), the feature extraction layers excluding the average pool, fully connected layer and softmax function from ResNet-50 and Darknet-53 were extracted and saved as separate weight files as feature extractor. The feature extractor models were concatenated with YOLOv3, resulting in ResNet50_YOLOv3 and DarkNet53_YOLOv3 object detection model. Before the training, the models were required to configured accordingly. In YOLOv3, the input image size must be in the multiples of 32 to be able to be downsampled in the detection layers. Thus, in this project, the input images were set at 416×416. The iterations which are the 'max_batches' were set at 4000 as a default where each class which are 'mask' and 'nomask' has to be trained with a minimum of 2000 iterations. As it is impractical to pass all the images from the dataset into training at one iteration, the images were divided into batches. In this case, the batch size was set into 64 where 64 images were fed into training at one iteration. To avoid memory error of the GPU, the subdivision was set to 16 where the GPU will process 4 images at a time as a minibatch. For the training, the learning rate was set at 0.001 at the beginning, decreased to 0.0001 starting at 3200 th iteration and 0.00001 starting at 3600 th iteration. This allows the model to learn faster at the beginning for gradient descent and proceeds to fine-tuning towards the end of the learning process [40]. The class was set to 2 and the filters number before the detection layer was set according to [(class + 5) * 3] which is 21 in this project. The models were trained with Darknet framework which is an open-source neural network framework written in C and compute unified device architecture (CUDA) and it supports CPU and GPU computation [41]. Overall, the development of the models was done on Google Colab Pro using Tesla P100-PCIE-16GB GPU.

Model performance evaluation
In object detection, the Intersection over union (IoU) is used to measure the performance of the training. It is used as a metric to evaluate how close the prediction bounding box is to the ground truth (the bounding box labelled manually). IoU is defined as the intersection area of predicted region and ground-truth region divided by the union of the two regions where we can determine the similarity among the two regions [42]. During the training, the model will learn and evaluate itself based on the IoU to improve its prediction as close as the ground truth. In this project, the IoU threshold was set at 0.5.
To evaluate the performance for the model, precision, recall and mean average precision (mAP) were calculated. The IoU threshold was set at 0.5 in this study to determine true positive (TP), false positive

389
(FP) and false negative (FN). If the prediction has an IoU larger than threshold, it will be categorized under TP which means the correct prediction. A prediction will be categorized as FP when the IoU score is lesser than threshold. Another 2 scenarios that will be categorized under FN is either when there is no detection at all (no bounding box) or the classification for the object is wrong. Therefore, using the number of TP, FP and FN, precision and recall can be calculated. Precision is the percentage of correct positives (TP) among all the predictions are made and can be used to determine the reliability of the model in classifying samples as positive [43]. Recall is the percentage of correct positive predictions among all the positives in reality (ground truth) [44]. A high recall indicates that the model is reliable in detecting positive samples. With precision and recall, a precision-recall graph can be plotted and the area under the graph can be calculated to determine the average precision of the class. A mAP is the mean of average precision between two classes which are masked and non-masked people in this project [45].
Besides measuring the mAP of each models, the detection time was measured as one of the criteria in evaluating a performance of the object detection model. The detection time was determined based on the time needed for the model to perform all the predictions in the testing set images. By evaluating the speed and accuracy of each model, the model with better performance was determined.

Performance evaluation
The models were tested on testing dataset with 776 images and the results were shown in the Table 2. ResNet50_YOLOv3 model obtained higher precision at 0.99 as compared to DarkNet53_YOLO that obtained precision at 0.96. This shows that the percentage of correct detections among all detections are made is higher in ResNet50_YOLOv3 model, indicating that ResNet50_YOLOv3 model had made less false positives in the prediction. However, ResNet50_YOLOv3 model had shown a significant lower recall compared to DarkNet53_YOLOv3 model. DarkNet53_YOLOv3 model achieved a recall of 0.96 while ResNet50_YOLOv3 achieved a recall with merely 0.58. A higher recall indicates that the DarkNet50_YOLOv3 model has better ability in identifying and detecting the objects in the image [46], while ResNet50_YOLOv3 model tends to have miss detections on the objects. Based on the results of average IoU, ResNet50_YOLOv3 shows better results with average IoU of 83.09% over DarkNet53_YOLOv3 model. This indicates that ResNet_YOLOv3 model performs better in localization, where the predicted bounding box are closer to the ground truth than DarkNet53_YOLOv3 model. On the other hand, the average precision of both classes which are mask, and no mask are higher in DarkNet53_YOLOv3, resulting in a higher mAP at 95.94% for the detections of mask and no mask. Meanwhile, the average precision of class no mask in ResNet50_YOLOv3 model has significant lower results at 78.95%, resulting in lower mAP at 84.40%.  DarkNet53_YOLOv3 has lower precision, it has a much higher recall with stronger ability to detect person with or without mask and has a better mAP. Looking into the architecture of the feature extractor models, Darknet-53 has deeper and complex networks with over 40 million trainable parameters [47] that guarantee its accuracy comparing to Resnet-50 which has over 23 million trainable parameters [48]. Interestingly, although DarkNet-53 has deeper networks, it still runs much faster than ResNet-50 and gives a promising speed and accuracy performance. To sum up, for this application in face and face mask detection, despite the better localization of ResNet50_YOLOv3 model, it has compromised on its speed of detection. Therefore, DarkNet53_YOLOv3 model is better as it outperforms the ResNet50_YOLOv3 model in its speed of detection, ability of detection and higher accuracy, which is more suitable to be used in commercial camera in the future.

Face and face mask detection
This section shows the example of the output image from the detection of ResNet50_YOLOv3 as shown in Figure 5(a) and DarkNet53_YOLOv3 as shown in Figure 5(b) model. Green bounding box refers to the ground truth while magenta box refers to the predicted bounding box by the model. Overall, the models can classify the classes properly. ResNet50_YOLOv3 model tends to have lower confidence while predicting the objects, and there are some miss detections by the model.

CONCLUSION
In summary, YOLOv3 based deep learning object detection was applied in this study to develop an automated face and face mask detection for multiple people that can be employed in commercial cameras in the future. The project started with data acquisition where the dataset was collected from online resources, followed by filtering and labelling of the images. The feature extraction layers from pretrained ResNet-50 and Darknet-53 were extracted and concatenated with YOLOv3 classifier and resulted in two models, ResNet50_YOLOv3 and DarkNet53_YOLOv3. Both models were trained with more than 4000 images on Google Colab Pro utilizing the darknet framework. The models were evaluated through precision, recall, mean average precision and detection time. Overall, DarkNet53_YOLOv3 model outperforms the ResNet50_YOLOv3 model, showing a well-balanced speed and accuracy performance, with a mean average precision of 95.94% and a much shorter time to perform the detection on the images. Besides, it also shows better ability to detect the faces on the image compared to ResNet50_YOLOv3 that has lower recall in the results. Thus, DarkNet53_YOLOv3 model was chosen as the better model as it is fast and reliable in detecting the faces with or without mask. This can be an initiative to help the authorities to monitor whether people are wearing face masks in public to minimize the transmission of COVID-19. The fast and accurate automated face mask detection is crucial during the pandemic to replace manual surveillance which is inefficient for detection of large groups of people, as well as providing a contactless monitoring method. This project can be improved by training on additional classes such as incorrectly worn mask where the nose or mouth are not covered by the mask, and the model can be integrated into webcam software or commercial camera for the development of prototype. The evaluation can be done based on frames per second on the live video and make improvement accordingly.