Backbone search for object detection for applications in intrusion warning systems

ABSTRACT


INTRODUCTION
In recent years, object detection algorithms using deep learning have achieved remarkable results.They have been widely used in the implementation of practical systems such as intrusion warning systems [1], [2].In these systems, object detection algorithms are used to detect human intrusions early by processing images from surveillance cameras [3].With large systems, processing locally is essential because it is timely and reduces the load on the central server [4].However, implementing deep learning algorithms on such device surveillance cameras is generally difficult because of hardware constraints [5].Therefore, scientists have proposed methods to compact deep learning models, to efficiently implement them in practice.
In general, there are two approaches to deep learning model compaction: network architecture search (NAS) and model compression [6].For the first approach, an optimal network architecture will be searched in a large space of potential architectures.This method is proven to be effective in cases where the designer has almost no knowledge of the data used [7].The popular methods for object detection problems can be mentioned as: Detection NAS (DetNAS) [8], method in [9], feature pyramid network NAS (FPN-NAS) [10], auto feature pyramid network (Auto-FPN) [11], structural-to-modular NAS (SM-NAS) [12].The common property of these methods is that they give a relatively good performance because the preset criteria are achieved during searching.In return, the computational cost for them including training time, power consumption is often very expensive and often have to run on powerful hardware.For the second approach, the design of the neural network requires the programmer to have data knowledge such as the variety of data based on previous studies [13].After that, a regular neural network model called the base model will be established without regard to the compaction criteria.To optimize the network, one will take advantage of model compression methods including pruning and quantization.These methods observe the base model to remove unnecessary parameters or approximate parameters with fewer byte representations.Methods of this type are usually not specific to any task, but rather general in machine learning.Well-known methods can be mentioned as: Filter pruning via geometric median (FPGM) [14], method in [15], method in [16], quantization after training (QAT) [17] and binarized neural networks (BNN) [18].Compared with neural network architecture search, model compression methods significantly reduce the computational cost while maintaining the optimal criteria.
In this work, to save computational resources, we propose a simple neural network architecture search model for object detection.We refer to previous literature to design an object detection model based on faster region-based convolutional neural network (Faster R-CNN) [19] and ResNet50 [20].We were inspired by EfficientNet [21] to find the width scale for the base model ResNet50 so that the scaled model meets the criteria for mAP, the number of parameters and the number of multiply-accumulate operation (MAC) operations.The reason we only looked for the width scale is that we found that many efficient object detection models use the feature pyramid network (FPN) structure [22], so scaling to the width should not affect it.Our main motivation is to suggest a simple, low-cost method that still guarantees reservation requirements.

RELATED WORKS 2.1. Faster R-CNN
The premier model within the R-CNN family is Faster R-CNN, introduced in 2015 [19].Subsequent versions of R-CNN family networks have undergone enhancements primarily focused on computational efficiency, incorporating diverse training phases, reducing inference time, and boosting overall performance measured by mean average precision (mAP).These networks typically consist of: a) an algorithm to find "bounding boxes" or possible object positions in the image; b) the stage of extracting the features of the object, usually using a convolution neural network (CNN) network; c) a classification network to predict the class of object; and d) a regression layer to make the coordinates of the bounding boxes more accurate.
Faster R-CNN combines 2 modules shown in Figure 1.The first module is to use a deep neural network (DNN) to propose the regions, called region proposal network (RPN) and the second module is the Fast R-CNN model that uses the proposed regions.Fast R-CNN will take the suggested regions from the RPN to determine the object corresponding to the anchor.Faster R-CNN also has a CNN backbone network, a region of interest (ROI) pooling layer and a full connectivity layer, followed by two sub-branches to perform the two tasks of classifying objects and finding the best bounding box based on the regression.
where  is the index of the anchor in the mini-batch and   is the predicted probability of anchor  being an object.The ground-truth label value   * is 1 if the anchor is positive, and 0 when the anchor is negative.  is a 4-dimensional vector representing the predicted bounding box coordinates.  * is a 4-dimensional vector representing the coordinate value of the ground-truth box corresponding to the positive anchor. is the number of anchor boxes taken out for consideration,  is the equilibrium coefficient (=1).  is the loss cross-entropy of 2 classes (Object and non-object),   uses SmoothL1Loss.

PDIWS dataset
PDIWS [2] is a rare thermal image dataset for human detection in intrusion warning systems.The special feature of this dataset is that it is artificial data, synthesized by the method of image editing in the gradient domain.Each image in the dataset is made up of a human subject with an intrusive pose and a background.The detailed specification of the dataset is described in Table 1.Accordingly, the data set includes 2,000 training images and 500 test images in a ratio of 80:20.Note that the subjects in these two subsets are completely distinct, so it ensures randomness when testing.In addition, the subjects are also scaled in different proportions before being glued to the background, so it describes the various distances from the camera to the subject.Figure 3

Object detection architecture
The proposed object detection architecture is based on Faster R-CNN.We divide this architecture into three parts for the convenience of description: feature extractor, region proposal network and output classifier (including region of interest (ROI) pooling and classifier).First, the input image is passed through a feature extractor to discover the features of the image.In the proposed architecture, the feature extractor is chosen as ResNet50 because it is popular in the object detection task along with visual geometry group At the region proposal network, at each pixel in the feature map, it is further fed into the convolution layer to generate feature vectors of length 256.The number of feature vectors is equal to the number of predefined anchor boxes.Then, a fully-connected layer is used to predict the object's presence and bounding box according to each anchor box.These predictions are then fed into ROI pooling to remove overlapping regions of the same object.An important point in the RPN network is the anchor box.They are usually preselected according to the size and proportions of the objects contained in the training data.In the proposed method, the recommended anchor boxes are 15 types with aspect ratios of 0.5, 1.0, and 2.0.
Finally, the output classifier with ROI pooling and classifier is activated.The predicted bounding boxes then capture corresponding regions in the 2,048 feature maps.The non-maximum suppression (NMS) algorithm is then utilized to remove boxes containing the same object.Then the filtered boxes are reshaped to 7×7 by the ROI pooling layer before passing through the classifier.The classifier is a fully-connected layer that projects the class of the object.Loss functions for the proposed method are the same as the original faster R-CNN described in section 2.1.

Backbone width search algorithm
This is the most important part of the proposed method.As argued in the Introduction section, scaling the neural network in breadth is advantageous for inserting feature pyramid networks, which have recently been widely advertised.Therefore, we propose a simple solution to scale the width with the essence of changing the number of output channels of each hidden layer in the neural network.Our algorithm is described in Algorithm 1 as follows, Accordingly, we have to predetermine the number of trials (num_of_trial) and select width scales following the uniform distribution U(0;1).At each value of the width scale, we rebuild the backbone by multiplying the number of output channels of each hidden layer with the width scale.The resulting network architecture is saved to check for duplicates in subsequent attempts.Then the model is trained from scratch with the new architecture and then the model along with the loss dictionary, mAP, number of parameters, and number of MACs values will be saved.In this paper, we choose 10 trials for simplification.The configuration parameters for the training process will be presented in the results section.

Evaluation criteria 3.3.1. Mean average precision (mAP)
In binary classification, the model assigns prediction scores to samples.Categorization into classes depends on a set threshold.If the score meets or exceeds the threshold, the sample is in one class; otherwise, it's in the other.Positive classification occurs when the score is at or above the threshold; negative classification applies when the score falls below it.We use the terms precision and recall to evaluate the performance of the binary classification for each class as shown in Figure 5,

Illustration of binary classification predictions
Of course, we want a model to have high precision and high recall.To take into account both metrics, we use the precision-recall curve, which is a plot of the precision (y-axis) and the recall (x-axis) for different probability thresholds.Then we compute average precision (AP) by averaging the precision values on the precision-recall curve where the recall is in the range [0,1], where N denotes the number of thresholds and p(r) denotes the value of the precision when the recall is equal to r.The mAP is calculated by the mean of AP values over all classes.

Number of parameters and multiply-accumulate operations (MACs)
For a fully-connected layer, given the number of input neurons is a and the number of output is b.In this case, the number of parameters and MACs is determined by, For a 2D convolutional layer, given the input is a×a×n, output of b×b×m and kernel size is k×k.Then the two terms parameters and MACs are defined by, 1135

RESULTS AND DISCUSSION
Our experiments are performed on the PDIWS dataset with the following training parameters: number of epochs (100), batch size (2), learning rate (0.005), momentum (0.9), optimizer (Adam).They are executed using Pytorch framework on a computer with Intel i7 12700F CPU, 16 GB RAM, and 24GB Nvidia GeForce GTX 3090 GPU.We first present the results of the backbone architecture search Table 3 with the criteria mAP_50 is mAP at IoU=0.5, mAP_75 is mAP at IoU=0.75 and mAP_coco is the average mAP with IoU=[0.5, 0.95] with step 0.05.As expected, architectures with small-width scales perform worse than larger ones.This can be explained by the small size of the model, the number of neurons is not enough to learn the necessary features.Meanwhile, network architectures larger than a certain threshold will not improve performance.Specifically, width scale values greater than 0.5 results in mAP_50 not being too different.The mAP value increases gradually as the width scale increases but does not exceed 47.55%, the increased amplitude is only approximately 2%.This trend is also present in mAP_75 and mAP_coco.Meanwhile, with a width scale of less than 0.5, the survey mAP values decrease exponentially as the width scale decreases from 0.5 to 0.1.On a scale of 0.1, the value of mAP_50 is only 18.17%, mAP_75 is 16.05%, mAP_coco is 13.54%, all extremely bad.In addition, for the number of parameters and MACs, the architectures searched had the number of parameters in million (M) exponential to the width scale (0.55M at 0.1 and 55M at 0.1).The number of MAC operations follows a similar trend, and varies in the form of an exponential function.Visualization of the table results can be found in Figure 6.Here, we can easily observe the trend of the criteria when the width scale changes.In order to choose a suitable scale, some threshold values must be predefined.In this paper, we think that the mAP_50 value must be above 45% for the model to be called good because this is the near-convergence value of the model.Therefore, from Figure 6, we choose a width scale value of 0.5 as the best value with mAP_50 reaching 45.42%, mAP_75 reaching 43.32%, mAP_coco reaching 40.51%, the number of parameters is 13.79M and the number of MAC operations is 1.89 billion (B).In addition, we can also choose another threshold value such as model size if the hardware is constrained or inference time if it requires real-time calculation.To increase the credibility of the obtained model as well as of the proposed method, comparative experiments were conducted.Table 4 summarizes the mAP values (at IoU=0.5), the number of parameters, MACs and inference time of various methods/models with different backbones.Specifically, the methods/models selected for the survey are well-known names such as single shot detection (SSD) [24], fully convolutional one-stage (FCOS) [25], you only live once version 3 (YOLOv3) [26].Following them are popular backbones such as VGG-16 [27], ResNet50 [20], and MobileNet-V2 [28].The results show that the criteria have trade-offs, so we should have a metric to compare the methods and choose the best one.Figure 7 shows some example results on test set with labeled bounding boxes over all classes.

Figure 2 .
Figure2.RPN architecture with anchor boxes[19] (a)  shows an example of a creeping subject, Figure3(b) shows an example of crawling subject, Figure 3(c) shows an example of stooping subject, Figure 3(d) shows an example of a climbing subject, and Figure 3(e) show an example of other subject in the dataset.In this paper, we divide the training set of the PDIWS dataset into two parts: training (1,750 images) and validation (250 images) at a ratio of 7:1, evenly distributed across classes.The purpose of this is to observe and evaluate the training process to decide to stop at the right time to avoid overfitting. ISSN: 2252-8938 Int J Artif Intell, Vol. 13, No.

Figure 4 .
Figure 4.The architecture of the proposed backbone width search for object detection is based on Faster R-CNN and ResNet50

Int
Backbone search for object detection for applications in intrusion warning systems (Nguyen DucThuan)

Figure 6 .
Figure 6.Visualization of search results

Figure 7 .
Figure 7. Examples of predicted bounding boxes over five classes in a test set

Table 2 .
We add a factor called width scale (w) to modify ResNet50 architecture as shown in Table2with the note that the number of output channels is the smallest upper bound divisible by 8.The output of the feature extractor is 2,048 feature maps with the size of each map reduced by 32 times in each dimension compared to the input.This 2048 feature map will go in two directions, one goes into the region proposal network and the other goes into the ROI pooling layer.Proposed ResNet50 architecture with width scale Backbone search for object detection for applications in intrusion warning systems (Nguyen Duc Thuan) 1133 network (VGG).

Table 3 .
Performance comparison between different searched model width

Table 4 .
Comparison between the obtained model with other object detection methods