Partial half fine-tuning for object detection with unmanned aerial vehicles

ABSTRACT


INTRODUCTION
Object detection on intelligent machines integrated with drones and cameras or unmanned aerial vehicles (UAVs) has been many applied to help various real-life domains, such as military [1], forest fire detection [2], agriculture [3], security [4], [5], and urban surveillance [6].That demands researchers in the field of object detection to be able to analyze images obtained from UAVs.Currently, numerous research has focused on deep learning [7] to overcome various object detection challenges, particularly in UAVs [8].
Deep learning advances are due to the availability of large-scale data, computing power such as graphics processing unit (GPU), and continuous research.Since deep convolutional neural network (deep CNN) has proposed by [9], deep learning methods have outperformed traditional machine learning in the imageNet large scale visual recognition challenge (ILSVRC) 2010 challenge [10].These results also encourage many proposed other architecture such as visual geometry group (VGG) [11], GoogLeNet [12], and residual networks (ResNets) [13], [14], cross stage partial network (CSPNet) [15], efficientNet [16] which are widely used as backbone for feature extraction in classification and object detection tasks.In the object detection task, region-based convolutional neural network (R-CNN) is the first deep learning-based object detection method [17] that has outperformed other traditional detectors such as deformable parts model (DPM) [18] and SegDPM [19] in the challenge PASCAL visual object classes (PASCAL VOC) 2010 [20].Other popular methods, such as the two-stage detector: Fast R-CNN [21], faster R-CNN [22], and the one-stage detector: you only look once (YOLO) [23]- [28], retinaNet [29], single shot multibox detector (SSD) [30] which have demonstrated strong  [31], PASCAL VOC, and imageNet.But those methods average it has poor performance when detecting small objects, which is the characteristic of the data captured by the UAVs.Therefore requires the right adjustment strategy in the model architecture or the right training strategy for object detection tasks in the UAVs.Much of the research recently has been proposed to overcome the challenge of object detection in UAVs.Such [32] used a deformable convolutional layer [33] in the last three stages of ResNet50, which also adopted a cascade architecture and augmentation data to improve model performance when detecting small and dense objects.[34] proposed RRNet with adaptive resampling as an augmentation technique followed by a regression module to improve bounding box prediction accuracy.The current work also involves transformer architecture which many use in natural language processing tasks [35].Such as [36] that replaced the YOLOv5 detection layer with the detection of transformers, and [37] proposed a ViT-YOLO that combined multi-head self-attention [38] in the original CSP-Darknet backbone YOLOv4-P7 [27], bidirectional feature pyramid network (BiFPN) [39], and YOLOv3 as head detection layers.ViT-YOLO obtained an mean average precision (mAP) of 39.41% and was one of the top results in the dataset VisDrone2021-Det 2021 challenge [40].The results not be separated by the availability of annotated data, which has become the key to the training process and benchmark.However, problems arise if the amount of annotated data is insufficient for the training process.It can easily be overfitting during the training process.Therefore also have an impact on the performance of the object detection method.While in practical situations, availability annotated data generally is difficult to obtain.
The current popular solution is to use a fine-tuning technique [41].Fine-tuning is a common technique that takes advantage of the pre-trained model on large-scale data.Then transferred its features to new tasks with fewer data.This technique has been evident to increase the generalization of the model and avoid overfitting [42]- [47].Such as has been done [17], [21]- [23], [29], [30] in general object detection tasks and [34], [36], [37], [48] in object detection tasks with UAVs.However, despite the immense popularity of finetuning techniques, no research focuses to studies the precise fine-tuning effects for object detection tasks with UAVs.Several reasons: i) the current work uses fine-tuning techniques without focus observing the impact of the transferred features or building an efficient fine-tuning strategy.ii) the differences in data sizes along with features are crucial to consider in the fine-tuning process.For example, the COCO [31] dataset and VisDrone [40] dataset have differences, as described in Figure 1 We show that partial half fine-tuning can outperform one of the state-of-the-art methods in object detection tasks with UAVs.(Section 3.4.3).

RESEARCH METHOD
In this section, we set up several fine-tuning procedures along with a partial half fine-tuning strategy to transfer a set of learned features in base-task and use them for target-task, as shown in Figure 2.That consists of three concepts: base-task, target-task, and transfer or fine-tuning.In base-task is shown in Figure 2 In   we leverage pre-trained YOLOv5 [49] on   COCO [31] in   .The total layers in the backbone YOLOv5 are 10 blocks which we denote as  0−9 .Then a fine-tuning process is carried out to   by training it to   VisDrone [41] to complete the   object detection on the UAVs.The details of each fine-tuning strategy, including the spatial half fine-tuning: First half F-T, and Final half F-T that we propose illustrated in Figure 2

EXPERIMENTAL 3.1. Dataset
To validate the effectiveness of the fine-tuning strategy, we used the dataset VisDrone2019-Det [50].The VisDrone2019-Det dataset is the same as VisDrone2021-Det [40].The dataset consists of ten object categories: (pedestrian, person, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motorcycle) with a total number of 10,209 images, 6,471 for training, 548 for validation, and 3,190 for testing.For all the finetuning techniques in this research, we train that on the VisDrone training set and evaluate them with a validation set.
Based on (1) and ( 2), where true positive (TP) is the detection correct from the ground truth bounding box, false positive (FP) is the object that was detected but misplaced, and false negative (FN) is the ground truth of the bounding box not identified.AP is the average value of P and R as shown in (3) ( ̃) is the measured P at R. Then mAP is the average of AP used to measure all class categories in the dataset and is a metric used to measure the accuracy of object detection.As shown in (4)   is the AP in class , and  is the total number of classes evaluated.

Experimental details
We use the pre-trained model YOLOv5x [49] as a base-task in the COCO dataset [31] same as that used in [36].We transfer features from pre-trained models to each fine-tuning technique: Common F-T, Frozen F-T, First half F-T, and Final half F-T.Then we train each that technique to the VisDrone target-dataset [50] as an object detection task in UAVs.In the training phase, we use stochastic gradient descent as optimization The framework for the training and validation process uses PyTorch with a Tesla T4 GPU.

Results and discussion
To reflect the scalability of each fine-tuning strategy.We analyze each fine-tuning approach with an input scale of 640×640, and we also involve traditional training (Traditional-T), that shown in (section 3.4.1).We analyze the training results with different input scales, shown in (section 3.4.2).In the last, we compare the best results in this study with one of the state-of-the-art methods that used the VisDrone validation set as the evaluation process and with the same input scale, shown in (section 3.4.3).

Analysis effect fine-tuning strategy
Table 1 presents the validation results for each fine-tuning strategy with an input scale of 640×640.Common F-T can achieve detection mAP accuracy 8.9% greater than Traditional-T.The deferences result of 8.9% from Traditional-T means Common F-T indicated more increases in generalization of the model than Frozen F-T that only increases by 3.5% mAP accuracy compared to Traditional-T.The low improvement from Frozen F-T that because the features transferred from the COCO base-task directly adjusted to the VisDrone target-task without setting random parameters, and no feature update during the training process to the targettask.For partial half fine-tuning: Final half F-T can achieve detection mAP accuracy 9.3% greater than Traditional-T, 5.8% from Frozen F-T, and at the same time also outperform Common F-T with a deference 0.4% more height, that results prove Final half F-T is the best strategy for fine-tuning in input scale of 640×640.But, the mAP accuracy for the first half F-T result of 5% is slightly lower than the result of Common F-T and 5.4% lower than the Final half F-T.Base on research conducted by Yosinski et al. [41] that shown a transition process when the features of the base-task transferred to the target-task.In our study, the low improvement from first half F-T that because the first half layer in the target-network learns general features but is more specific to the COCO dataset than to the VisDrone dataset in the target-task.While the Final half F-T proves that the last half layer is more common or matches the features of the COCO base task and VisDrone target task.The details of the detection from Table 1 described in Table 2.While Figure 3 shows one of the results of the visualization of detection with a validation set.

Analysis on different scale
To obtain a more in-depth analysis, we evaluated each fine-tuning technique with different input scales, namely 416×416, 608×608, 832×832, and 960×960, as described in Table 3. Common F-T outperforms the results of Traditional-T, Frozen F-T, First half F-T, and particularly, Final half F-T by a difference of 0.1% on 416×416 and 608×608 scales.However, on the 832×832 and 960×960 scales, the final half F-T is superior to Traditional-T, Frozen F-T, first half F-T, and especially Common F-T by a difference of 0.5% on both scales.These results indicate that the Common F-T is slightly higher than the Final half F-T with a smaller input scale.However, based on our experiment, the results of Final half F-T are more robust with higher input scales than Common F-T, Traditional-T, Frozen F-T, and First half F-T.

Compare with state-of-the-art
In this section, we compare our proposed partial half fine-tuning: Final half F-T with one of the stateof-the-art methods employed by [48].We only compared it with one previous work because the authors used the same input scale and VisDrone validation set for the evaluation process.We focus on the 832×832 scale  [48].Such as described in Table 4, the results of Final half F-T are 20.3% greater than SlimYOLOv3-SPP3-50 and 19.7% than YOLOv3-SPP3.That result indicates our proposed Final half F-T is better than one of the previous studies.

CONCLUSION
In this study, we conduct experimental analysis on every existing fine-tuning approach and propose a partial half fine-tuning strategy which consists of two techniques: First half F-T and Final half F-T.In the evaluation process, we used the VisDrone validation set.Here we show that the result of Final half F-T can achieve detection mAP accuracy 9.3% greater than Traditional-T, 5.8% from Frozen F-T, and 0.4% from Common F-T in an input scale of 640×640, and its also more accurate at higher scales, such as in scale 832×832 and 960×960.Then we compared the final half F-T with one of the state-of-the-art methods, based on the mAP IoU 0.5 and the same 832×832 input scale.Here we show that the results of final half F-T are 20.3% greater than SlimYOLOv3-SPP3-50 and 19.7% than YOLOv3-SPP3.That means our technique is better than other finetuning techniques and also better than one of the state-of-the-art methods in object detection with UAVs.

Figure 1 .
Figure 1.Image from dataset, (a) VisDrone dataset and (b) COCO dataset Figure 2. Fine-tuning strategy, (a) The base model in base-task uses a pre-trained model, (b) Common F-T, (c) Frozen F-T, (d) First half F-T, and (e) Final half F-T . With the following details, -Common fine-tuning (Common F-T): transfer feature   from pre-trained model to   with condition parameter feature  0−9 on   initialized randomly when trained with   in specific   .The details of the illustration are explained in Figure 2(b).-Frozen fine-tuning (Frozen F-T): transfers feature   from pre-trained model to   with condition  0−9 on   frozen, meaning that there is no process of changing parameters in   during training process with   for   .The details of the illustration are explained in Figure 2(c).-First half fine-tuning (First half F-T): transfers feature   from pre-trained model to   with condition  0−4 on   frozen.Then  5−9 initialized randomly, meaning that during the training process on   on   there was no process of changing the parameters of   on  0−4 .However, in  5−9 there are changes during the training process.The details of the illustration are explained in Figure 2(d).-Final half fine-tuning (Final half F-T): is the opposite of First half F-T, in Final half F-T transfer feature   from pre-trained model to   with condition  5−9 on   in frozen, and  0−4 initialized randomly, where during the training process with   on   parameter changes to   only occur in  0−4 .The details of the illustration are explained in Figure 2(e).

Int
Partial half fine-tuning for object detection with unmanned aerial vehicles (Wahyu Pebrianto) 403 with momentum 0.9, batch size 16, learning rate 0.01, and training iterations of 50 epochs with 640×640 input.

Figure 3 .
Figure 3.Some visualization results from our research

Table 1 .
Evaluation results with VisDrone validation set

Table 2 .
Detection results with VisDrone validation set

Table 3 .
Validation result at different scales using the VisDrone validation set ISSN: 2252-8938  Partial half fine-tuning for object detection with unmanned aerial vehicles (Wahyu Pebrianto) 405

Table 4 .
Results of comparison with one of the state-of-the-art methods