Performance analysis of different DCNN models in remote sensing image object detection

In recent years, deep learning, especially deep convolutional neural networks (DCNN), has made great progress. Many researchers use different DCNN models to detect remote sensing targets. Different DCNN models have different advantages and disadvantages. In this paper, we use YoloV4 as the detector to “fine-tune” various mainstream deep convolutional neural networks on two large public remote sensing data sets−LEVIR data set and DOTA data set to compare the advantages of various networks. This paper analyzes the reasons why the effect of “fine-tuning” convolutional neural networks is sometimes not good, and points out the difficulties of object detection in optical remote sensing images. To improve the detection accuracy of optical remote sensing targets, in addition to “fine-tuning” convolutional neural network, we also provide a variety of adaptive multi-scale feature fusion methods to improve the detection accuracy. In addition, for the large number of parameters generated by deep convolutional neural network, we provide a method to save storage space.


Development of deep convolutional neural network
As a feature extractor, the backbone network plays an important role in the performance of the detection model. With the development of network architecture in recent years, there are many excellent backbone networks. Therefore, it is of great significance to study the influence of different deep convolutional neural networks on target detection of optical remote sensing images. In the following, we will summarize the popular backbone networks in recent years, including most of the mainstream deep convolutional neural networks.

YoloV4
YoloV4 is an efficient target detection method. It mainly consists of four parts: Input, Backbone, Neck and Prediction. Backbone is used for feature extraction, Neck is used for multi-scale feature fusion, and Prediction is used for classification and bounding box prediction. YoloV4 adopts CSPDarknet53 as the feature extraction network, Neck adopts the structure of spatial pyramid pooling (SPP) [29], feature pyramid network (FPN) [19] and path aggregation network (PAN) [20] for feature fusion. Prediction generates anchor frame through clustering method, uses binary cross entropy loss for category prediction, and uses dimensional clustering machine to predict boundary frame. The network flow chart of YoloV4 is shown in Fig. 1.

FPN
FPN fuses the deep feature information with the shallow feature information through upsampling, thereby constructing the feature pyramid structure of different sizes.

Dimensional clusters
YoLo uses a dimensional cluster to predict the bounding box, as shown in Fig. 3. First, the YoLo model decomposes the image into S * S grids, each of which is assigned three bounding boxes. Then, four coordinate values are predicted for each bounding box by dimensional clustering: t x , t y , t w , t h , where (t x , t y ) is the predicted coordinate offset and (t w , t h ) is the scale. The central coordinates (b x , b y ) and length and width (b w , b h ) of the prediction box can be calculated according to Equation (1)(2)(3)(4). Where, p w and p h are the length and width of the bounding box, and (c x , c y ) are the offset of the cell where the bounding box is located. Finally, the confidence can be obtained through the intersection and association ratio (IoU) between the prediction box and the real box, and the prediction box with low confidence can be eliminated by non-maximum suppression (nms).

Fine-tuning the backbone structure in YoloV4
With the development of DCNN, some new DCNNs have emerged, and we try to use the new DCNNs in recent years for the feature extraction of YoloV4. At present, new and mainstream DCNNs architectures, such as Inception, SENet, MobileNet, EfficientNet, etc., cannot be directly applied to YoloV4. This is because their structural parameters are different, making their network outputs unsuitable for multi-scale feature fusion in the Neck stage, so we need to adjust these DCNNs frameworks. When different DCNNs are applied to YoloV4, the fine-tuning of the network structure is also different. For VGG16, VGG19, ResNet50, and ResNet101, we removed the last feature pooling layer and the full connection layer, and then connected directly to the neck network. VGG networks mainly increases the network depth by stacking convolutional layers to improve detection accuracy. For example, the backbone network of VGG16 consists of 5 convolutional blocks and 5 max-pooling layers. The convolutional blocks respectively contain (2, 2, 3, 3, 3) convolutional layers with convolution kernel 3 × 3 and stride 1, as shown in Fig. 4a. The backbone network of VGG19 is also composed of 5 convolution blocks and 5 max-pooling layers, which, respectively, contain (2,2,4,4,4) convolutional layers with convolution kernel 3 × 3 and stride 1. ResNet networks use residual modules to fuse shallow information with deep information to solve the degradation problem of deep networks. For example, the backbone network of Resnet50 consists of a 7 × 7 convolutional layer with stride 2 and padding 3, a max-pooling layer and 4 residual blocks, where the 4 residual blocks contain (3, 4, 6, 3) resBottleneck modules respectively, as shown in Fig. 4b. The resBottleneck module uses convolution with stride 2 for downsampling, and performs residual learning between three convolution layers. The convolution kernels of the three convolution kernels are 1 × 1 , 3 × 3 and 1 × 1 respectively. For InceptionV3 and InceptionV4, to connect the neck network for feature fusion, we changed the effective convolution of the Inception module to the same convolution. Effective convolution is actually a convolution without padding, and the same convolution is a convolution with zero padding. The convolutional layers of the Inception backbone network do not have zero padding, resulting in the output feature maps not suitable for multi-scale feature fusion networks. Therefore, we use zero padding for the 3 × 3 convolutional layers in the Inception backbone. The backbone network of Incep-tionV3 consists of 5 convolutional layers, 2 max-pooling layers, 10 Inception blocks and 2 Reduction blocks, as shown in Fig. 4c. The backbone network of InceptionV4 consists of 3 convolutional layers, 17 Inception blocks and 2 Recuction blocks. The Inception block uses convolution kernels of different sizes to extract features from the upper layer separately, and then concatenates them to obtain better results. The Recuction block uses a 3 × 3 convolutional layer with stride 2 and a max-pooling layer to downsample the feature maps of the upper layer. For the Resnet evolution series, the ResNet modules in the four residual blocks of the Resnet backbone network are mainly replaced by the ResNeXt module, SENet module, SKNet module and Res2Net module. Figure 4d shows the backbone network of SENet, which consists of a 7 × 7 convolutional layer with stride 2 and padding 3, a max-pooling layer and 4 residual blocks. The four residual blocks contain (3, 4, 6, 3) seBottleneck modules, and each seBottleneck module mainly embeds an SE block in the resBottleneck module of ResNet to model the interdependence between channels. The SE block contains a global average pooling layer and two fully connected layers.
For SqueezeNet, ShuffleNetV2, MobileNetV2-V3 and GhostNet lightweight network models, we remove the last GlobalPool, Conv2d and FC layers. Figure 4e shows the backbone network of MobileNetV2, which mainly consists of a 3 × 3 convolution layer with stride 2, 7 mbBottleneck blocks and a 1 × 1 convolutional layer. The mbBottleneck block reduces network parameters and improves network speed by splitting the 3 × 3 standard convolution into a depthwise convolution and a point-wise convolution. The depthwise convolution is actually a grouped convolution, and the pointwise convolution is a 1 × 1 convolution.
There are also some other methods that use networks designed by borrowing the advantages of each branch. For example, EfficientNet improves detection accuracy by increasing the size of network depth, network width, and input image resolution. As shown in Fig. 4f, EfficientNetB0 consists of a 3 × 3 convolution layer with stride 2, 16 effBottleneck modules and a 1 × 1 convolutional layer. The effBottleneck module mainly consists of an SE block and a depth-wise separable convolutional block to scale the depth and width of the network model.

Adaptive multi-scale feature fusion method
YoloV3 uses the top-down FPN structure for feature fusion, as shown in Fig.5a. YoloV4 adds a bootom-up PAN structure on the basis of FPN for feature fusion, as shown in Fig.5b. To better fuse features of different scales, we design a new adaptive spatial feature fusion module (ASFF for short) inspired by spatial feature fusion [30]. The adaptive spatial feature fusion module allows the network to learn how to spatially filter the useless information of other layers and retain only the useful information for fusion. We first use the proposed ASFF module on the basis of FPN+PAN to further fuse features, as shown in Fig.5c. At the same time, we also use the ASFF module behind the FPN to fuse the features, as shown in Fig.5d. Experiments show that using the adaptive spatial feature fusion module behind the FPN can better improve the detection accuracy than using the adaptive spatial feature fusion module behind the PAN. This shows that the combination of FPN and ASFF modules can better fuse features.
The proposed adaptive spatial feature fusion module can be represented by formula (5) and formula (6). Equation (5) means that for each level, the features of all other levels will be adjusted to the same shape, and feature fusion will be performed according to the learnable weight parameters. Specifically: 1) for the level-l feature map (c, h, w), we first need to perform upsampling or downsampling operations on the feature maps of the remaining layers to resize them to the level-l output size and 2) then, the three adjusted feature maps are connected and a 1 × 1 convolutional layer is used for dimensionality reduction to obtain a 3 × h × w feature map, and then normalized by the softmax activation function to obtain the weight vectors of parameters α , β and γ ; 3) Finally, the weight vectors α , β and γ are multiplied and summed with the three feature maps respectively to obtain the fused feature map.

Receptive field improvement method based on dilated convolution
There are many small target sizes in optical remote sensing images. With the deepening of neural network depth, the features of small targets are easily lost. To improve the accuracy of small target detection, we replace the standard convolution of the fifth stage of CSPDarknet53 with dilated convolution. Dilated convolution can improve the resolution without increasing the number of parameters. For the input image of 640 * 640 , the resolution of the output convolution feature layer in the fifth stage of CSPDarknet53 was reduced to 1/32 * 1/32 of the original image. We replaced the standard convolution in the fifth stage of CSPDarknet53 with dilated convolution, and the resolution was only reduced to 1/16 * 1/16 of the original image, and the process of feature learning was  Figure 5a is the pyramid structure, Fig.5b is the combination structure of pyramid and path aggregation, Fig.5c is the combination structure of pyramid, path aggregation and adaptive spatial feature fusion, Fig.5d is the combination structure of pyramid and adaptive spatial feature fusion deepened at the same time. The modified structure of the network is shown in Fig. 6. To simplify understanding, only the modification to the basic network part is drawn, and the structure of the entire detector is no longer drawn.

Model parameter reduction method based on grouped convolution
The size of the storage space required by the fine-tuned deep convolutional neural network model affects the applicability of deep convolutional neural network. To reduce the storage space, we refer to the innovation point of MobileNet module and replace the traditional convolution computation method with Depthwise convolution [8], where the grouping number is equal to the maximum common divisor of the number of input convolution channels and the number of output convolution channels. Compared with standard convolution, Depthwise convolution can reduce the amount of computation exponentially without affecting the accuracy, so as to reduce the number of model parameters and improve the operation speed. Finally, Pointwise convolution is used to solve the problem of "non-flow of information" in Depthwise convolution. This operation is equivalent to a regularization of the features extracted by grouped convolution, which is more conducive to the flow of information. The modified structure of the backbone network is shown in Fig. 7. To simplify understanding, the structure of the entire detector is no longer drawn.

Experimental results and discussion
All experiments in this section use Linux 18.04 system, RTX 3090 graphics card, Intel (R) Core (TM) i7-10700K (3.8GHz) CPU, 64G memory, PyTorch [31] framework commonly used for deep learning, and software, such as Python 3.8, CUDA 11.0, and Torch 1.7. This section compares the detection performance of different DCNNs on optical remote sensing images. The selected DCNN models include VGG series, Inception series, ResNet and improved series, lightweight series, EfficientNet, Darknet53, CSP-Darknet53. The network parameters were set as the input image size is 640, the cycle iterations is 100 times, the size of each batch is 16, the optimizer selects SGD, the network learning rate is set to 0.01, and the learning momentum is set to 0.937. The data set

Qualitative analysis
First of all, we qualitatively analyze the performance of various deep convolutional neural networks on the optical remote sensing LEVIR data set, which is composed of 800 * 600 pixels and more than 22,000 pictures, covering most types of ground features of human living environment. There are three target types in the data set: airplanes, ships and oil tanks, including 4724 airplanes, 3025 ships and 3279 oil tanks. The average number of objects per image is 0.5. Since there are many networks for experimental comparison, some test results of DCNN model are randomly selected here for effect display and qualitative analysis of experimental results. For the detection model parameters, we set the confidence threshold at 0.001 and the IoU threshold at 0.5. The detection results of LEVIR data set of different DCNN models are shown in Fig. 8. By observing the test results of the randomly selected DCNN model on LEVIR data set, it can be concluded that DCNN can better obtain the detection results of LEVIR data set, and the positioning effect is more accurate. When the target size changes to some extent, YOLO detectors with different DCNN can still obtain better detection results. However, some DCNNs have some missed detections and false detections during the detection process, which indicates that the corresponding DCNNs need to further improve their classification ability.
Second, we will qualitatively analyze the representation of each DCNN on the optical remote sensing DOTA data set. The optical remote sensing data set contains a total of 21,046 images of 15 target types, with approximately 188,000 targets, and the image size is 800 * 800 pixels. The detection model parameters were set as 0.001 confidence threshold and 0.5 IoU threshold. The detection results of different DCNN models on optical remote sensing DOTA data sets are shown in Fig. 9.
By observing the test results of the network model of the randomly selected DCNN models on the DOTA data set, it is known that it is sometimes difficult to accurately detect directly using the existing deep convolutional neural network, and there are many missed and false detection. The main reasons for the poor detection results are as follows: 1. After the small size target passes through the multi-layer convolutional neural network, the effective positioning information is lost seriously, so it is difficult to obtain accurate results directly by deepening and broadening the network. 2. The contrast between some targets and the surrounding environment is relatively low, and the classification ability of some deep convolutional neural networks is insufficient, so it is difficult to carry out a good classification operation. 3. DOTA data set is stored into many relatively dense small-size targets, and the dense distribution poses a certain challenge to the accuracy of target detection.

Quantitative analysis
Qualitative detection results can only give people a certain intuitive feeling, but lack persuasion. Therefore, quantitative analysis must be conducted to judge the advantages and disadvantages of each deep convolutional neural network. Mean average precision (mAP) is a common criterion for target detection in optical remote sensing images. AP is the area under the curve with accuracy on the vertical axis and recall rate on the horizontal axis. mAP is the average of AP's for all categories. The test time and storage space are very important for the real-time performance and practical application of the target detector, and we also use them as performance indicators. The quantitative structure of each DCNN model in the LEVIR and DOTA data sets is shown in Tables 1 and 2. Where mAP@0.5 means that when the detector IoU threshold is set to be greater than 0.5, the average precision AP of each category is calculated, and then, the average AP of all categories is calculated to get the mAP. mAP@[.5:.95] represents the average mAP at different IoU thresholds (from 0.5 to 0.95 with a step size of 0.05). As can be seen from Table 1, the mAP of most DCNNs in LEVIR data set exceeds 80%. As for LEVIR data set, the target type is relatively single, the contrast with the background is relatively high, and the interference from the surrounding environment is relatively low, so it is not difficult to distinguish the background from the target. At the same time, LEVIR's target number is small and the target distribution is sparse, which is conducive to target detection. In addition, the study found that increasing the depth or width of DCNN does not necessarily improve the accuracy of target detection, such as VGG19, InceptionV4 and Resnet101. Meanwhile, for some improved residual convolutional networks, such as ResNeXt50 and SK-ResNet50, the target detection accuracy is not improved much. In addition, for some of the latest DCNN models, such as GhostNet   and Res2Net50, the improvement of target detection accuracy is not necessarily effective. For lightweight DCNN models, such as ShuffleNet and MobileNet, it is effective to improve the detection speed, but the detection accuracy is not as good as CSPDarknet. It can be seen from Table 2 that, for the detection of DOTA data set, many DCNN models do not achieve ideal results, and some networks achieve poor results. On mAP@.5, except CSPDarknet53, none of the other DCNN models exceeded 70%, and some of the DCNN models have mAP@.5 less than 65%. The main reasons for this result are as follows: 1. Compared with LEVIR data, the situation of DOTA data is more complex, with a larger number of targets and smaller target size. 2. The contrast between DOTA targets and the surrounding environment is relatively low and more dense, making it more difficult to distinguish targets. Similarly, it is found that increasing the depth or width of the convolutional neural network does not necessarily improve the accuracy of target detection, such as Inceptionv4 and Resnet101. Meanwhile, for some improved residual convolutional networks, such as ResNeXt50 and SK-ResNet50, the target detection accuracy is not improved much. For lightweight convolutional neural networks, such as Shufflenet and MobileNet, it is effective to improve the detection speed, but the detection accuracy is not as good as CSPDarknet. In addition, for some of the latest DCNN models, such as GhostNet and Res2Net50, the improvement of target detection accuracy is not necessarily effective. This further validates the conclusion analysis in Table 1.
It can be seen from Tables 1 and 2 that to achieve high precision performance on optical remote sensing data sets, it is not only necessary to increase the depth or width of the network, or simply change the structure of the convolutional neural network. Therefore, convolutional neural networks wants to achieve higher detection accuracy on optical remote sensing data sets, not only related to the network depth, width and network structure, but also related to the network feature fusion mode.

The test results of the proposed method in LEVIR data set are analyzed
In this section, each scheme we designed will be compared in detail. In the next section, we will compare with other commonly used detection methods based on deep convolutional neural network, such as Faster R-CNN, CenterNet, etc. First, we conducted an experimental comparison of the three proposed multi-scale adaptive spatial feature fusions on the LEVIR data set, and the experimental results are shown in Table 3. The basic network of all the following comparative experiments was CSPDarknet53 and the detector was YoloV4. Network parameters were set to loop iteration 100 times, the size of each batch was 16, the optimizer selected SGD, and the network learning rate was  Table 4. The multi-scale feature fusion method uses the best-performing feature pyramid network and adaptive spatial feature fusion structure. For comparison, we use CSPdarknet53 as the baseline, and get 0.882% of mAP@0.5 and the pre-trained model size is 420.8 M. As can be seen from the second row of Table 4, we replaced the standard convolution of the fifth stage of CSPdarknet53 with dilated convolution, and the detection accuracy was improved by 1.2%. This is because dilated convolution can increase the resolution without increasing the amount of parameters, thereby improving the accuracy of small target detection. Then, as can be seen from the third row of Table 4, we replace the standard convolution of CSPdarknet53 with group convolution to greatly reduce the storage space. This is because compared to standard convolution, grouped convolution can reduce the amount of calculation exponentially without affecting accuracy. Finally, after the various schemes are integrated, the accuracy is improved by 2.6% and the storage space is reduced by 111.7 M compared with the original method.

Performance comparison with other DCNN-based detection methods
In this section, we compare our proposed method with several popular DCNN-based object detection methods. Our proposed method for optical remote sensing image object detection uses YoloV4 as the detector, fine-tunes the CSPDarknet53 backbone network and adopts the FASN structure for multi-scale spatial feature fusion. The detectors selected for comparison include two-stage detectors and single-stage detectors. The two-stage detectors is Faster R-CNN [10], and the single-stage detectors include RetinaNet [14], YoloV3 [15], YoloV4 [16], and the anchorless CenterNet [17]. The comparison detector we selected will compare with the deep convolutional neural network commonly used by the detector. The corresponding detection structure of each detection method is shown in Table 5. From the detection results, the detection results of Faster R-CNN, CenterNet and YoloV3 are relatively poor, while our method has good results in both accuracy and efficiency, and has lightweight characteristics.

Conclusions
In the target detection of optical remote sensing image based on DCNN, two parts should be considered generally, one is the selection of deep convolutional neural network, the other is the selection of detector. In this paper, the single-stage detector YoloV4 is chosen as the detector considering the real-time performance of the project. Although the single-stage detector is not good at small scale target detection in optical remote sensing images, it can meet the requirements of accuracy as well as efficiency with the fine-tuning of network structure. With different researches on detectors, DCNN model is also developing continuously. It is significant to study the influence of DCNN model on optical remote sensing image target detection. Through the study, it is found that the deep convolutional neural network with multi-scale feature fusion is suitable for the single-stage detector YoloV4. We fine-tuned the network structure based on YoloV4 and tested two optical remote sensing data sets. The experimental results show that the proposed method can achieve good detection results in both simple and complex cases.