Review of Deep-Learning-Based Object Detection
The two-stage detection method is also called region-based object detection, and the region can be treated as a candidate proposal. This framework matches the human visual mechanism to a certain extent. First, the entire image is roughly scanned (the generation of candidate frames), and it then focuses on the region of interest for category subdivision and location fine-tuning. Figure 1.6 lists some milestones in the development process of the two-stage detection framework.
RCNN is the most famous work in the two-stage detection method, proposed by Girshick et al. in 2014 . The subsequent two-stage detection algorithm is improved based on RCNN. As shown in Figure 1.6(a), the training of RCNN contains the following steps:
- 1. Through region proposal algorithms generate category-independent candidate boxes;
- 2. Crop all the candidate boxes from the original image and scale them to a uniform size (e.g., 227x227). Use these samples to fine-tune CNN (e.g., AlexNet ) to obtain the convolutional features of each candidate box. The CNN needs to be pre-trained on ImageNet .
- 3. Use CNN to extract convolutional features to train a set of class- specific linear support vector machine (SVM)  for classification;
- 4. Use CNN-extracted convolutional features to train category-specific bounding box regression to fine-tune the location of the candidate box.
Despite high detection performance, RCNN has three limitations. First, its offline training process has multiple stages, which are completely individual from each other. Second, its training process needs high computational costs and GPU memory. Third, its inference is not efficient enough. The main reason for the inefficiency is the need to extract features separately for each candidate box, without feature share mechanism.
FIGURE 1.6 Two-stage detectors*.
Girshick et al. further introduced the multi-task loss of classification and regression and proposed a new CNN detection structure, called Fast RCNN . As shown in Figure 1.6(b), Fast RCNN allows the joint optimization of the network through the multi-task loss, thereby simplifying the training process. Fast RCNN also adopts the mechanism of shared convolutional calculation, and an Rol pooling layer is designed to extract the fixed-length features of each candidate box. It is then input into a series of fully connected layers and split into two branches, i.e., softmax classification and bounding box regression. Compared to RCNN, Fast RCNN greatly improves the efficiency of training and inference; that is, the training speed is 3 times faster and the inference speed is 10 times faster. In short, Fast RCNN can achieve better detection accuracy and learning process.
Based on Fast RCNN, Ren et al. further proposed Faster RCNN . In the two-stage detection method before Faster RCNN, an extra algorithm is required to predict candidate boxes, which is usually time-consuming. Faster RCNN introduces Region Proposal Network (RPN) to generate candidate boxes. In this manner, RPN and Fast RCNN are unified into a network for joint optimization, and they share backbone features (as shown in Figure 1.6(c)), thus greatly reducing the time cost of candidate box prediction. At the same time, the region proposal generated by RPN will be further regressed and classified for more accurate localization and recall rates. With VGG16  as the backbone, Faster RCNN can achieve a GPU speed of 5 FPS and obtain the best detection performance at that time on PASCAL VOC .
Starting from Fast RCNN, the features of each candidate box are extracted with Rol pooling and then through an Rol-wise subnetwork to further power feature representation. Although the whole process is implemented under the condition of feature share mechanism, the calculation of each candidate box feature in the Rol-wise subnetwork is performed independently. When the subnetwork or the number of candidate box is relatively large, the time-consumption of this part will increase significantly. Therefore, Dai et al. proposed RFCN , whose main idea is to minimize the module without sharing calculation. Specifically, RFCN uses convolutional layers to construct the entire network and proposes position-sensitive Rol pooling (PSRoI pooling) instead of using Rol pooling layer. In the last convolutional layer, a series of specific convolutions are introduced to generate position-sensitive feature maps, and then PSRoI pooling is used to extract the features of the candidate boxes. Finally, the weighted voting is performed to obtain the detection confidence and location offset. The structure of RFCN is shown in Figure 1.6(d). Compared to Faster RCNN, RFCN can improve the speed of the detector while achieving comparable detection accuracy.
Based on the two-stage detection method, a series of improvements have also been made. Yang et al. proposed a multi-stage cascading method
, which cascades a binary Fast R-CNN after region proposal to further filter out some simple backgrounds. The second stage also uses a cascaded Fast R-CNN network for more accurate multi-category prediction by multi-step classification. For dealing with occlusion of objects, Wang et al. proposed an Adversarial Spatial Dropout Network (ASDN) , which predicts occlusion mask for each candidate box to simulate the situation of occlusion, and then used the occlusion-aware features for classification and regression. ASDN is analogical to data augmentation at the feature level. For describing the geometric deformation of the object, Dai et al. proposed Deformable Convolutional Networks (DCN) . By learning offsets for each convolution kernel, an irregular filter is constructed to capture the geometrical deformation of the object. In addition, Zhu et al. further proposed DCNv2 , which also predicts weights for convolutional kernel to model the importance of different offsets and obtain better ability of deformation. Expanding the two-stage detection method into a multi-task framework, He et al. proposed Mask R-CNN , where detection, key points prediction, and instance segmentation can be jointly optimized.