Joint Anchor-Feature Refinement for Real-Time Accurate Object Detection in Images and Videos


Object detection is one of the fundamental and challenging areas of research in computer vision. With rapid advances in deep learning, convolutional neural networks (CNN) have demonstrated the state-of-the-art performance in this task. Zhao et al. presented an overview of modern object detection approaches [1]. From this review, we can see that two-stage detectors represented by RCNN family [2] and RFCN [3] usually attain an accurate yet slightly slow performance. On the contrary, by detecting objects in a one-step fashion, single-stage detectors [4, 5] are able to run in real time with reasonably modest accuracy. Therefore, fast, accurate detection remains a challenging problem for real-world applications.

It is instructive that the two-stage method induces high accuracy, while the single-stage detector has a desirable inference speed. This inspires us to investigate the reasons. In our opinion, the high accuracy of two-stage approaches comes with two advantages: (1) Two-step regression and (2) relatively accurate features for detection. In detail, two-stage detectors firstly regress predefined anchors with the aid of region proposal [2], and this operation significantly eases the difficulty of final localization. Besides, an Rol-wise subnetwork [2] is appended to the region proposal part, so features in the region of interest can be leveraged for final detection. By contrast, there are two drawbacks in the single-stage paradigm: (1) Detection head directly regresses coordinates from predefined anchors, but most anchors are far from matching object regions, and (2) classification information comes from probably inaccurate locations, where features could not be precise enough to describe objects. Referring to Figure 4.1a, it is relatively difficult to regress predefined anchors to precisely surround the object (e.g., the dog in Figure 4.1). Moreover, as feature sampling locations follow predefined anchor regions, detection features for small-scale anchors cannot cover the entire object region, while that for large-scale anchors weaken the object because of background. On the contrary, owing to region proposal, the two-stage methods detect the dog using a better initialization (see Figure 4.1b). Thus, the strengths of two-stage methods exactly reflect the single-stage drawbacks that lead to relatively lower detection accuracy. Although Zhang et al. developed RefineDet [6] to introduce two-step regression to the single-stage detector, it still failed to capture accurate detection features. That is, predefined

Comparison of single-stage anchors and RPN outputs. For better visualization, only several key boxes are demonstrated, (a) Multi-scale SSD anchors, (b) RPN outputs in Faster RCNN

FIGURE 4.1 Comparison of single-stage anchors and RPN outputs. For better visualization, only several key boxes are demonstrated, (a) Multi-scale SSD anchors, (b) RPN outputs in Faster RCNN.

feature sampling locations are not precise enough for describing refined anchor regions. (Note that detailed comparison between RefineDet and our approach will be presented in Section 4.3.2.) Thus, there is an imperative need for further overcoming these single-stage limitations for realtime accurate object detection.

In addition, most researches have largely focused on detecting object statically, ignoring temporal coherence in real-world applications. Detection in real-world scenes was introduced by ImageNet video detection (VID) dataset [7]. To the best of our knowledge, main ideas of temporal detection include (1) post-processing [8], (2) tracking-based location [9, 10], (3) feature aggregation with motion information [9, 11-14], (4) RNN- based feature propagation [13,15-17], and (5) batch-frame processing (i.e., tubelet proposal) [18]. All these ideas are attractive in that they are able to leverage temporal information for detection, but they also have respective limitations. In brief, methods (l)-(4) borrow other tools (e.g., tracker, optical flow, and LSTM) for temporal analysis. Methods (3) and (4) focus on constructing superior temporal features. Nevertheless, they detect objects following the static mode. Method (5) works in a non-causal offline mode that prohibits this approach from real-world tasks. Furthermore, most recent works pay excessive attention to accuracy so that high computational costs could affect time efficiency. Thus, a novel temporal detection mode should be developed for videos.

Overcoming aforementioned single-stage drawbacks, a dual refinement mechanism is proposed in this chapter for static and temporal visual detection, namely, anchor-offset detection. This joint anchor-feature refinement includes an anchor refinement, a feature location refinement, and a deformable detection head. The anchor refinement is developed for two-step regression, while the feature location refinement is proposed to capture accurate single-stage features. Besides, a deformable detection head is designed to leverage this dual refinement information. Based on the anchor-offset detection, we propose three approaches for object detection in images and videos. Firstly, a dual refinement network (DRNet) is proposed. DRNet is designed for static detection, where a multi-deform- able head is developed for diversifying detection receptive fields for more contextual information. Secondly, temporal refinement networks (TRNet) are designed, which perform anchor refinement across time for video detection. Thirdly, temporal dual refinement networks (TDRNet) are developed that extend the anchor-offset detection toward temporal tasks.

Additionally, for temporal detection task, we propose a soft refinement strategy to match object motion with previous refinement information. Our proposed DRNet, TRNet, and TDRNet are validated on PASCAL VOC [19], COCO [20], and ImageNet VID [7] datasets. As a result, our methods achieve a real-time inference speed and considerably improved detection accuracy. Contributions are summarized as follows:

  • • Starting with drawbacks of single-stage detectors, an anchor-offset detection is proposed to perform two-step regression and capture accurate object features. The anchor-offset detection includes an anchor refinement, a feature-offset refinement, and a deformable detection head. Academically, without region-level processing, this joint anchor-feature refinement achieves single-stage region proposal. Thus, the anchor-offset detection bridges single-stage and two-stage detection so that it is able to induce a new detection mode.
  • • A DRNet based on the anchor-offset detection and a multi-deformable head is developed to elevate static detection accuracy while maintaining real-time inference speed for the image detection task.
  • • Asa new temporal detection mode for video detection task, a TRNet and a TDRNet are proposed based on the anchor-offset detection without the aid of any other temporal modules. They are characterized by better accuracy vs. speed trade-off and have a concise training process without the requirement of sequential data. In addition, a soft refinement strategy is designed to enhance the effectiveness of refinement information across time.
  • • The single-stage DRNet maintains fast speed while acquiring significant improvements in accuracy, i.e., 84.4% mean average precision (mAP) on VOC 2007 test set, 83.6% mAP on VOC 2012 test set, and 42.4% AP on COCO test-dev. Based on VID 2017 validation set, DRNet sees 69.4% mAP, TRNet achieves 66.5% mAP, and TDRNet obtains 67.3% mAP.

The remainder of this chapter is organized as follows. Section 4.2 presents the related works. Including anchor-offset detection and multi-deformable head, DRNet is elaborated in Section 4.3. Section 4.4 presents TRNet and TDRNet in detail, and Section 4.5 provides the experimental results and discussion. Conclusions are summarized in Section 4.6.

< Prev   CONTENTS   Source   Next >