Benchmarks of Object Detection

PASCAL VOC

PASCAL VOC is one of the representative datasets in the field of object detection [6]. This dataset is divided into two sub-datasets, i.e., VOC2007 and VOC2012, which require the algorithm to predict objects among 20 categories (i.e., vehicles, airplanes, trains, cats, dogs, and birds). VOC2007 contains 9,963 image samples, which are divided into training set, validation set, and test set, which contain 2,501, 2,510, and 4,952 image samples, respectively. VOC2012 contains 22,531 image samples, which are also divided into the training set, the validation set and the test set, which contain 5,717, 5,823, and 1,091 samples, respectively.

In general, as for VOC2007 setting, the training set and the validation set of VOC2007 and VOC2012 (called 07+12, a total of 16,551 image samples) are combined for training, and the VOC2007 test set is used for evaluation. In terms of VOC2012, the training set and the validation set of VOC2007 and VOC2012 as well as the VOC2007 test set (called 07++12, a total of 21,503 image samples) are combined for training, and then the VOC2012 test set is leveraged for evaluation.

MS COCO

MS COCO is a dataset for object detection and segmentation tasks in recent years [7]. It is also one of the most popular data sets in the scope of object detection. The dataset requires the algorithm to predict objects among 80 categories (e.g., people, bottles, and chairs.). In 2014, the dataset disclosed 82,783 training samples, 40,504 validation samples, and 40,775 test samples. Although there were adjustments later, the training set and the validation set were almost unchanged. The annotation of the training set and the validation set are available, and the annotations of the test set are unpublished. One need to submit results online for evaluation. In addition, MS COCO divides objects into three types according to object size. That is, small, middle, and large objects should have object sizes of <32[1] [2], <96[2], and >96[2], respectively.

Usually only 5,000 samples in the validation set are used for validation (called minival2014), while the training set and the remaining validation set (called trainval35k, a total of 118287 image samples) are combined for training. Evaluation process is based on the test set (called test-dev, a total of 20288 image samples).

ImageNet VID

Unlike PASCAL VOC and MS COCO, samples of the ImageNet VID dataset are videos. This dataset requires the algorithm to predict objects among 30 categories (e.g., sheep, dog, airplane, ship, etc.) in videos. The training set of ImageNet VID contains 4,000 video samples, and the validation set contains 1,314 video samples.

Although there are abundant image data, the disadvantage of video data is the lack of diversity, which damages model generalization. Therefore, the ImageNet DET dataset is usually used to train static detectors. DET contains 200 object categories and the categories of VID is a subset of that DET, so the combined VID and DET (only VID-consistent categories) are used for training. Specifically, at most 2,000 samples are selected for each category from DET and 10 samples are selected for each category from VID to form the training set. Model evaluation is conducted based on the validation set of VID.

Evaluation Metrics

Object detection is evaluated with an average precision (AP). First, the IoU between prediction and ground truth is calculated, and thus prediction is divided into 3 categories.

  • 2) False positive (FP): Prediction is an object, but it does not match a ground truth
  • 3) False negative (FN): Under the condition of specific IoU, there is no prediction that can match the ground truth.

Therefore, recall rate can be formulated as:

Precision can be formulated as:

According to detection confidence, detection results can be sorted. Thereby, different recall rates can be obtained based on confidence threshold, and P can be described as a function of R so that the P-R curve can be obtained. As a result, AP is the area under P-R curve as follows:

AP over multiple classes, i.e., mean average precision (mAP) can be formulated as:

In most benchmarks, the IoU is fixed, e.g., 0.5 in PASCAL VOC. In contrast, evaluation of MS COCO is based on uniformly varying IoU, i.e., IoU S [0.5:0.05:0.95]. In this manner, 10 mAP are obtained, and their average is used to describe detection performance. In addition, MS COCO evaluates size-related AP, producing APS APM and APL for small, middle, and large objects, respectively.

  • [1] True positive (TP): Prediction is an object, and it matches a ground
  • [2] truth.
  • [3] truth.
  • [4] truth.
 
Source
< Prev   CONTENTS   Source   Next >