Review of Deep Learning-Based Object Detection
CNN-Based Static Object Detection
Deep learning methods have recently dominated the field of object detection . Two-stage detectors [2, 3, 21, 22] usually detect objects by region proposal, location, and classification. For example, inspired by Faster RCNN  and RFCN , CoupleNet  leveraged both region-level and part-level features to express a variety of challenging object situations, which achieved considerable detection accuracy but it only ran at 8.2 FPS. As groundbreaking works, YOLO  and SSD  localized and classified objects using a single-shot network for real-time detection. Recently, many revised single-stage versions have emerged [6, 23-28]. Typically, in favor of small object detection, Lin et al. developed a RetinaNet to formulate the single-shot network as an FPN  fashion for propagating information in a top-down manner to enlarge shallow layers’ receptive field . Redmon and Farhadi proposed YOLOv3 with DarkNet53 and multi-scale anchor for fast, accurate detection . Zhang et al. designed a RefineDet to introduce two-step regression to single-stage pipeline . RefineDet adjusted predefined anchors for more precise localization. However, its detection features were still fixed on predefined positions, failing to precisely describe refined anchor regions. In short, although single-stage methods have a superiority in speed, two-stage methods still dominate detection accuracy on generic benchmarks [7, 19, 20]. Hence, we are motivated to analyze single-stage drawbacks from two-stage merits (analyzed in Section 4.1) and construct DRNet with both competitive accuracy and fast speed.
Temporal Object Detection
To detect objects in temporal vision, some post-processing methods have been first investigated to merge multi-frame results, and then tracker- based detection, motion-guided feature aggregation, RNN-based feature integration, and tubelet proposal are studied by the research community. Han et al. proposed a SeqNMS to discard temporally interrupted bounding boxes in the non-maximum suppression (NMS) phrase ; Feichtenhofer et al. combined RFCN and a correlation-filter-based tracker to boost recall rate . Based on motion estimation with optical flow, Zhu et al. devised a temporally adaptive key frame, scheduling to effective feature aggregation ; Chen et al. and Liu et al. took advantage of long short-term memory to propagate CNN features across time [16, 17].
However, the temporal analysis capacity in the abovementioned methods is obtained from other temporal tools. Although some methods focused on how to construct superior temporal features, they still remained inapposite static detection mode. As a typical offline detection mode, Kang et al. reported a TPN for tubelet proposal (i.e., temporally propagated boxes) so that multiple frames could be simultaneously processed to improve temporal consistency . However, this batch-frame mode struggled to be qualified for real-world tasks. On the contrary, without the aid of any other temporal tools, we developed a novel real-time online detection mode for videos using the idea of refinement. That is, refined anchors and refined feature sampling locations are generated with key frames, which would be temporally propagated for detection. Compared to most video detectors, our methods have a concise training process without the need for sequential images.
Sampling for Object Detection
It is widely accepted that spatial sampling is important to construct robust features. For example, Peng et al. detected objects by an improved multistage particle window that can sample a small number of key features for detection . In terms of CNN, canonical convolution is based on a square kernel that is not suited enough to variform objects. For augmenting the spatial sampling locations, Dai et al. proposed deformable convolutional networks to combat fixed geometric structures in traditional convolution operation. The deformable convolution significantly boosted the detection accuracy of RFCN . As for video detection, Bertasius et al. used the deformable convolutions across time and constructed robust features for temporally describing objects . Zhang et al. designed a feature consistency module with deformable convolution to reduce inconsistency in the single-stage pipeline . Wang et al. proposed guided anchoring for RPN, Faster RCNN, and RetinaNet to achieve a higher quality region proposal . Creatively, we tend to capture accurate single-stage features, and more specifically, refined feature locations are produced based on refined anchors. Moreover, we propagate refinement information across time for video detection.