Anchor-free object detection with mask attention

The anchor-free method based on key point detection has made great progress. However, the anchor-free method is too dependent on using a convolutional network to generate a rough heatmap. This is difficult to detect for objects with a large size variation and dense and overlapping objects. To solve this problem, first, we propose a mask attention mechanism for object detection methods and make full use of the advantages of the attention mechanism to improve the accuracy of network detection heatmap generation. Then, we designed an optimized fire model to reduce the size of the model. The fire model is an extension of grouped convolution. The fire model allows each group of convolutional network features to learn the same feature through purposeful grouping. In this paper, the mask attention mechanism uses object segmentation images to guide the generation of corner heatmaps. Our approach achieved an accuracy of 91.84% and a recall of 89.83% in the Tencent-100 K dataset. Compared with the popular object detection methods, the proposed method has advantages in model size and accuracy.


Introduction
In recent years, applications based on object detection technology have become more widespread [1]. Common applications are pedestrian detection [2], vehicle detection [3], image retrieval [4], and traffic sign detection [5]. This imposes higher requirements on the detection performance and size of the object detection method. As a typical method of object detection, anchor-based object detection can often be seen in common application scenarios. Anchor-based object detection methods have made great progress in location and recognition performance [6,7]. These anchor-based methods regress the boundaries of objects by generating dense coordinates on the feature map. In order to get a higher quality coordinate box, non-maximum suppression method is used to filter to most of the overlap bounding boxes. However, computing anchor boxes and filter boxes requires a large amount of computing resources. At the same time, the pre-designed anchor box is not aligned with the real boundary of the object. As a result, anchor-based has defects in calculation and accuracy.
Recently, it is best to use multiple receptive field branches and take a weight-sharing approach to achieve the best results based on Anchor-based [24].
The anchor-based method constrains the regression results by proposing dense anchor points on the image. This method simplifies the learning task of convolutional networks, but makes the detection method less flexible. Thus, this results in a convolutional network predicting regression coordinate size limits, fixed shape, fixed aspect ratio, and small plasticity. The anchor box is always not aligned with the regression results, resulting in a small intersection over union (IOU) value of the box. The anchor aspect ratio, size, and shape in reality are highly uncertain. A manually set anchor cannot match it. In the training phase, the regression task needs to be performed by comparing each result with the IOU of the anchor. The time cost and space consumption are high. Based on the above reasons, this article uses anchor-free object detection method. This method has no fixed anchor constraints and can more flexibly match objects of various proportions.

Anchor-free detectors
Anchor-free method is another popular method, and YOLO has been considered as the representative of anchor-free object detection. YOLO method was to get rid of the R-CNN series of methods for two-stage inference. YOLO coordinates and classifies objects directly from the image. This method of directly returning to the frame coordinates without using an anchor point greatly improves the speed of object detection. However, this rough detection method results in low accuracy and inability to detect small objects. DenseBox is considered to be the earliest anchor-free method [10]. DenseBox uses five heatmaps to classify faces and two corner coordinates. Because of the defects of DenseBox in the overlapping processing of heatmaps, the recall rate of DenseBox in general object detection is low.
CornerNet removes the anchor box setting and uses the top-left corner and the bottom-right corner to represent the boundaries of a box [25]. Main task of CornerNet is to predict the position of two corners, and classified tasks are distributed in corner detection tasks. In order to improve the recall rate, corner pool is used to maximize each other at the two boundaries in order to improve the probability of the occurrence of objects. Inspired by Cornernet, CenterNet uses the midpoint of the object as the detection center [8]. It can detect the object by adding the boundary and size information of the object. FCOS [26] uses center-ness to suppress excessive and inappropriate box boundaries and uses multiple and classified objects to classify. ExtremeNet [27] uses standard key points to estimate four poles (top, left, bottom, right) and a center point of the network detection object. Compared with other anchor-free methods, CornerNet makes full use of the edge information of objects. In particular, the corner embedding vector makes the pair of corners clear.

Object segmentation
The performance of object segmentation methods based on convolutional networks is increasingly higher. FCN [28] directly uses a convolutional network to make binary predictions for each category. The FCN method uses a convolutional neural network to extract the semantic features of the image. This segmentation method gets rid of traditional color and shape features, thereby reducing the difficulty of image segmentation. Mask-rcnn directly adds the segmentation task to the object recognition task, which provides a new idea for object detection. After that, U-Net [29] and V-Net [30] encoder-decoder networks made full use of skip structures for feature fusion and made great progress in the fields of medicine and 3D detection. Our approach is different from the above. We use the results of the segmentation module to guide the convolution to generate new features.

The proposed method
Our approach is based on CornerNet-Squeeze. This method uses corner points to represent the bounding box of the object. Use the heat map to directly return coordinates. In order to eliminate the deviation of the coordinates, use offset to correct. By analyzing the network structure and overall method of CornerNet, it is found that the heat map is the most critical point of the whole method, especially the key position and the value of the heatmap value. But this feature is not really utilized in the entire network model. And the detection image of the corner point is completely dependent on the heat map. The accuracy of the corner heat map depends entirely on the training data and the feature extraction ability of the convolutional neural network. However, it is more difficult to completely rely on the convolutional network to generate an accurate corner heat map. Object segmentation can highlight the area containing the target. Inspired by this, this paper uses a masking module to enhance the expression of object. The main function of the mask module is to generate a complete segmented image based on the original image. The segmentation result image is fused with the internal feature map of the convolutional network to generate enhanced features.

Base framework
This section describes the overall framework and process of the proposed method. Figure 2 illustrates the overall framework of our approach. Our method uses two hourglass network structures in tandem, which effectively utilizes the contextual features of the image. Based on the symmetrical nature of the hourglass network, the backbone network is elegant. The hourglass network is followed by the mask module. Our mask model is more than just an additional separate module; it is a feature-enhanced The mask module focuses on the critical areas containing objects by constraining the convolutional network, thereby improving the detection accuracy of the network. In order to fully utilize and mine the contextual features of the feature map, we combine the outputs of the two hourglass network elements by element.
The Squeeze method can sufficiently reduce the number of channels in the convolutional network. Because the rough grouping method does not highlight the relationship between the channels, the relationship between the groups cannot be directly mined. The Spatial Grouping Enhancement (SGE) module can generate an attention factor for each spatial position in each convolutional semantic group [31]. Therefore, we adjust the importance of each channel group function through SGE so that each individual group can autonomously enhance its learned expression and suppress possible noise. With almost no additional learning parameters added, we optimized the fire module. The advantages of group convolution can reduce the computational and parametric advantages. By separately performing local and global feature similarity for each set of convolutions, the spatial distribution of semantic features is enhanced to achieve a clear division of labor for each channel.
The coordinate regression and classification of the network is realized by detecting the top-left and bottom-right. The final detection of the network includes two branches, top-left and bottom-right detection modules. Detection modules include heatmaps, offsets, and embeddings.
The heatmap contains C channels (C is the target category, no background category), and each channel is a binary mask indicating the angular position of the corresponding category. For each corner point, there is only one ground-truth, and the other positions are negative samples. During the training process, the model reduces the negative samples and sets a positive sample in the radius r region at each ground-truth corner. This is because the vertices falling within the radius r region can still generate a valid boundary of objects.
Offsets are a correction module that is specifically set to correct errors in the heatmap corners. Due to the presence of down-sampling, the heatmap size generated by the model is smaller than the original input image. After n times of downsampling, the position of the point (i, j) mapped onto the heatmap on the original image becomes (⌊x/n⌋, ⌊y/n⌋). Re-mapping the points on the heatmaps to the original image has a quantization map error, resulting in an offset in the corner position of the map. This offset can seriously affect small object IOU calculations. The correction of the detection result is obtained by adding the heat map prediction result and the offset.
Embedding vectors are the key to corner detection. By encoding the corner points, the relationship of the embedded vectors in the top-left and bottom-right corners is obtained. The distance of the embedding vector of two corner points is used in the article to determine whether it is a pair of corner points of the same object.

Fire module
The use of the group convolution method can reduce the use of parameters, but cannot improve the feature representation of the convolutional network. In order to enhance the feature representation of the model, we absorbed the inspiration of the SEG module  [31]. The fire module is shown in Fig. 3. Each SGE module increases the number of parameters by about 2 times the number of groups, and the number of groups is usually 32 or 64.
The fire module groups feature on channels. Group convolution divides a channel into multiple sub-functions to represent different semantics. For a particular set of semantics, it is reasonable and beneficial to generate corresponding semantic features at the correct spatial location of the original image. In the fire module of this paper, first, squeeze the internal convolution feature maps of each group to obtain the global semantic parameters of each feature map, such as Eq. (1). Multiply the semantic parameters and the original features element-by-element, to obtain the feature map c i of each channel. Then normalize the feature map c i in each group. The normalization of c i introduces only two learning parameters, allowing SEG to automatically normalize. The activation function is used to assign weight to feature a i , and finally, a i is multiplied element-by-element with the original feature.
First, the features are grouped. Spatial averaging function F gp can get the global semantic feature g. This allows you to quickly get the global semantics of each grouping.
In the convolution integral group, the global feature of each channel is multiplied by the original feature point to obtain the initial attention feature c i .
The mean μ c within the group is subtracted from each group and divided by the variance σ c within the group. The two scaling offset parameters allow the normalize operation to be restored, then the sigmoid to get the final attention mask and scale the feature at each location in the original feature group.
Similar to BN, in order to enhance standardization versatility, add two additional parameters (λ, β) to represent the standard scale and offset. This is the two sets of nonconvolution parameters that need to be learned in the fire model.
In order to obtain the enhanced feature b x i , the weight distribution feature map ai of each channel is multiplied element-wise by the original feature map.
The spatial averaging function is considered to be a global representation of the channel scale. Each group of feature maps captures a specific semantics during the learning process. We use the global average feature to represent the learning characteristics of each group. Enhanced features can be obtained by using global features and local features for fusion. Our improved fire model just learns two parameters (λ, β) and adjusts the mask for feature enhancement.

Mask module
Our mask is inspired by image segmentation. Since CornerNet relies heavily on the generation of heatmaps, it is difficult to make predictions directly from the heatmap. We add a mask module to generate the split image directly. As described in Fig. 4, to achieve spatially enhanced features, the results of using the mask module are merged with the feature map. The mask constraint map increases the level of attention of the target object area. To simplify the process of feature extraction, we use a similar expectation-maximization algorithm to get the mask. This method uses the strategy of maximum likelihood estimation to get the mask. The internal features of the mask have the characteristics of low rank. This allows the mask to reduce the internal differences of the categories while maintaining the differences between the categories. The mask model rearranges the original features. The spatial distribution will change to facilitate heat map detection.
In order to ensure the simplicity of the original method, the mask module cannot be complicated. The Maximum Attention Mechanism (EMA) method abandoned the process of calculating the attention map on the full map [32]. The hidden variable is calculated iteratively using the expectation-maximization algorithm. And the attention mechanism is run on the hidden variable, which greatly reduces the complexity of the algorithm.
Inside the mask module, a 1 × 1 convolution kernel is used to squeeze the input feature channel to reduce the amount of calculation. Squeezing the input image channel can effectively use resources. Then upsample the feature map to the output size. The output size is half of the original image. Use EMA to calculate the feature map to get the output mask and bases parameter μ. The converged bases parameter μ and latent variables Z can be reconstructed output mask. Finally, the out mask and the original feature map are multiplied by the group elements to obtain the final feature. Out mask is the result of instance segmentation. Figure 5 is the result of object segmentation detection on some data. The EMA module is composed of three parts: responsibility estimation (E), likelihood maximization (M), and data re-estimation (R). E and M is the E step and the M step of the EM algorithm. The size of the feature map X is C × H × W. To simplify the symbols, reshape X to N × C, where N = H × W. Then, feature map is X ∈ R N × C , the base initial value is μ ∈ R K × C , and Z ∈ R N × K is a hidden variable. Ae estimates Z, and Am is used to update μ. Ae and Am alternately perform the T step to obtain an approximate estimate X of the feature. We set T to the default 3. Ar to get the featuresx through μ and Z.
Step E calculates the posterior distribution of the hidden variable Z.
Step M updates μ by maximizing the likelihood function.
After E and M alternately perform the T step, μ and Z are used to reconstruct the feature map.

Training
L det is the loss function for the regression of two diagonal corners of an object. L det indicates that focal losses are used to constrain the position of the corner points. The category of the object is predicted by a non-standard Gaussian heatmaps. Locate a pair of coordinate points of the object through the top left and bottom right heatmaps. L mask is the loss function of the mask model. The mask module is an instance segmentation network, which is trained with a multi-class cross entropy loss function. The parameters of the mask module are updated using image segmentation.
L push and L pull are used to constrain the correlation between the corners of the same object and different objects. L pull is used to constrain the distance between a pair of corner points of the same object. The distance between the corner points of the same object is the smallest. L push is used to constrain the corner points of different objects, and the corner points of different objects are maximized.
L off is used to compensate and correct the predicted and true value deviations, constrained by Smooth L1 Loss. Due to the large amount of down-sampling used in the convolutional network, the final coordinate regression and the coordinates in the original image are offset. Here we again predict an offset value to adjust the position of the angle. Here we use the smooth L1 Loss function as a penalty function.
The loss function for joint training for all tasks is as follows. α, β, λ, γ denote the weight of each task, the set value is 0.1, 0.1, 1,1.
4 Experimental results and discussions

Datasets and implementation details
In this section, the experimental process and materials are detailed. We validated our method using the Tsinghua-Tencent 100K dataset [33]. The data set contains more than 10,000 images, of which there are 6000 training images and 3000 verification images. Three scales are set for the object scale, small, medium and large. The size of each object is [0-30], , and [96-400].
We analyzed the number of labeled objects in multiple small areas and the number of objects in each category. Detailed statistics are described in Fig. 6. The statistics of the number of individual categories are shown in the upper part of the figure. There are a total of 182 tag categories in the data set. Only the categories with more than 100 labels are shown in the figure. The lower part of the figure is the statistics of the size of the label object. Objects with size < 50 were found to account for 64.3% of all marked objects, while size ≥ 90 was only 11.3%. It can be clearly seen from the figure that small objects account for a large proportion in the data set. This makes it very challenging to detect this data set. This is very consistent with the actual car driving image. To reduce the intensity of training, we used the first 42 categories, which accounted for 89.5% of all tag categories.
Our method is implemented using Pytorch. The entire network parameters are randomly initialized. The size of the network input image is 512 × 512. The size of the heatmap output is 128 × 128. The output size of the mask is 256 × 256.We set both α and β to 0.1 and γ to 1. Training Mask-CornerNet using batch size = 13, learning_rate = 0.00025, the learning rate per 180,000 iteration is multiplied by 0.1. Use adam optimizes as the optimizer for the network. All experiments were verified on a NVIDIA GTX1080Ti GPU with 32GB memory workstation. In the model evaluation phase, select the top 100 top left corner and the top 100 bottom left corner in the heat map.

Results
In order to verify our method from multiple aspects, we compared the methods commonly used today, such as traffic sign detection methods based on real environment [33], using pyramidal convolutional networks to detect traffic signs [5], Faster-rcnn, yolov3 [12], CornerNet [25] , CornerNet-Squeeze and CornerNet-Saccade [14]. At the same time, in order to verify the effectiveness of each module from multiple sizes. Set multiple single optimized models based on the baseline method. CornerNet-fusion merges multiple features. CornerNet-fire configures the fire module of this article. CornerNet-mask contains the mask module. Mask-CornerNet is the method proposed in this paper. All reference object detection methods use the same data enhancement method. We study these object methods from three object scales: large, medium and small. Experimental results verify that the method in this paper has great advantages on multiple object scales. The accuracy-recall curves of our method on multiple scales are shown in Fig. 7. Our Mask-CornerNet method performs well on small size and comprehensive size objects. The proposed method has an accuracy rate of 90.11% for small sizes and a recall rate of 88.42%. The accuracy of baseline in small-sized objects is 90.74%, and the recall rate is 86.33%. Pyramid convolutional networks do not have much advantage on small-sized objects. The accuracy and recall rate are 90.23 and 85.25%, respectively. Yolov3 detects multiple feature maps in small size with an accuracy rate and recall rate of 91.51 and 82.22%. From the perspective of full size, the overall size accuracy and recall rate of our method are 91.63 and 89.83%. CornerNet has an accuracy of 90.29% and a recall of 89.22%. Conernet-Saccade has an accuracy of 90.33% at all scales and a recall of 84.14%. Our method improves the accuracy of comprehensive size detection by increasing the detection accuracy of small sizes.

Ablation study
We compare experiments by adding a single module to explore how much improvement each module can bring to object detection. All experiments were performed under the same conditions, environment, and hyper-parameters. This article uses the recall rate(R), accuracy rate (A), and mean average precision (mAP) as the evaluation criteria of the model. In this paper, mAP is the area enclosed by the Precision-recall curve and the X-axis. To take full advantage of contextual features, we use the output of a fusion of two hourglass networks. As can be seen from Table 1, the use of fusion features can significantly improve the network recall rate. The recall rate increased by 0.28%, while the accuracy rate decreased. After adding the fire model, it increased by 0.67% on the basis of the baseline. After adding the mask module, the accuracy of the model has increased by 0.71%. From the experimental results, all the added modules can improve the performance of object classification detection, thereby promoting the accuracy of the model. From mAP, our model's overall performance is optimized. The mAP of the proposed model is 90.01% vs. 89.84% of the baseline.
Our final model uses all the lifting methods, fuses contextual features, and optimizes the fire module to improve the segmentation network's ability to express features. Finally, on the baseline, we increased the recall rate by 1.04% with less cost.
In order to explore the specific influence of context on model feature expression, we perform object detection directly on the feature output of each hourglass network structure. The first layer is stack-1, the second layer is stack-2, and the fusion feature is stack-1 + stack-2. Table 2 shows our experimental results. The recall rate using the stack-1 model alone was 82.34%. The feature recall rate using fusion was 89.83%. It is able to increase by 7.49% on this basis. The tendency bias of recall rate and quasicurvature was clearly balanced.

The effect of the number of proposals
Since Mask-Cornernet uses the top left and bottom right to detect objects separately, generally, the top 50-200 corner points with a score greater than the threshold are used. Use embedding vector similarity to match a pair of corners. We explored the Table 1 Results of ablation experiments of various components influence of the number of corner points on the performance of object detection. Our experimental results are shown in Fig. 8. By adjusting the number of detection points in the experiment, it was found that the number of detection points did not affect the coordinate regression. This eliminates the impact of detection points on our algorithm.

Accuracy and efficiency
Our proposed method Mask-CornerNet is compared with ConerNet, YOLOv3, and RetinaNet in terms of time consumption and accuracy. Figure 9 shows the results of our experiment. Explore the performance of all methods on 6 different images of 128, 256, 448, 512, and 640. Our model size and schedule relationship are given in Table 3. Our fire-model can directly reduce the size of the entire CornerNet-squeeze by 112 Mb, while the accuracy is not reduced too much. After adding the Mask module, our model size has also been reduced to only 112.8 Mb. Our anchor-free object detection maintains its advantages in small-size object detection models. In this paper, the fire module and the mask module use a small number of parameters to optimize the baseline model. The increase in inference time is mainly caused by channel fusion and instance segmentation. Figure 9 shows a comparison between our model and other models in terms of GPU consumption time. All experiments were performed in a consent environment. Due to the influence of the data set, the accuracy of all models on each input image size fluctuates without reference value. So we just compare the time consumed by the GPU with the recall rate. It takes 47 ms for YOLO-v3 to get a recall of 86.44%. CornerNet-squeeze takes 32 ms to the highest recall rate of 88.79%. After adding the fire module, the size of the model is significantly reduced. It takes only 41 ms for the recall rate to reach 85.82%, and 47.2 ms for 89.45%. Mask-CornerNet will take 52 ms to reach 87.21%, and it will take 62 ms to reach a recall rate of 89.82%. At the same time, Mask-ConerNet and ConerNet-fire are the same size. The CornerNet-Saccade and CornerNet models do not perform well on our dataset. CornerNet-Saccade takes 167 ms to obtain a recall rate of 86.27%, and it takes 301 ms to reach a recall rate of 87.70%. It takes 127 ms for CornerNet to reach a recall of 86.09%. And it takes 195 ms to reach a recall rate of 89.20%.

Conclusion
This paper proposes an anchor-free object detection method with mask attention mechanism. In order to make full use of the feature maps of convolutional networks, this paper fuses feature maps of multiple scales for object detection. The fire module proposed in this paper is beneficial to the model being transplanted to mobile devices.  In this paper, the fire module can semantically group convolutional network channels to improve the function distribution of the convolution kernel. To directly use the convolutional network to generate the corner detection heat map is not ideal, this paper proposes to use the mask mechanism to guide the corner detection network. The instance segmentation results are fed back into the convolutional network to increase the diversity of convolutional network feature maps and enhance the network's ability to express. Our method was evaluated on the Tsinghua-Tencent 100 K dataset. And compared with the commonly used object detection method yolov3, and the latest CornerNet, CornerNet-squeeze method. Experimental results show that our method performs well on the data set.
The detection speed and model size of our model have room for further improvement. Next, we will design a more concise feature extraction network optimization model scale and detection accuracy. The mask module generates an instance segmentation with time consumption. We will further study the instance segmentation module with better time performance. This article mainly uses a traffic sign dataset containing a large number of small objects, and then we expand to the detection of other natural objects. This article also provides a feasible idea for improving the detection performance of traffic sign objects.