The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a benchmark in image classification and object detection. The publicly released dataset contains a set of manually annotated training images. A set of test images is also released, with the manual annotations withheld. ILSVRC annotations fall into one of two categories: (1) image-level annotation of a binary label for the presence or absence of an object class in the image, e.g., “there are cars in this image” but “there are no tigers,” and (2) object-level annotation of a tight bounding box and class label around an object instance in the image, e.g., “there is a screwdriver centered at position (20,25) with width of 50 pixels and height of 30 pixels”. The ImageNet project does not own the copyright of the images, therefore only thumbnails and URLs of images are provided.
9,882 PAPERS • 96 BENCHMARKS
The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.
7,062 PAPERS • 83 BENCHMARKS
The Charades dataset is composed of 9,848 videos of daily indoors activities with an average length of 30 seconds, involving interactions with 46 objects classes in 15 types of indoor scenes and containing a vocabulary of 30 verbs leading to 157 action classes. Each video in this dataset is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacting objects. 267 different users were presented with a sentence, which includes objects and actions from a fixed vocabulary, and they recorded a video acting out the sentence. In total, the dataset contains 66,500 temporal annotations for 157 action classes, 41,104 labels for 46 object classes, and 27,847 textual descriptions of the videos. In the standard split there are7,986 training video and 1,863 validation video.
301 PAPERS • 5 BENCHMARKS
The PASCAL Visual Object Classes (VOC) 2012 dataset contains 20 object categories including vehicles, household, animals, and other: aeroplane, bicycle, boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa, TV/monitor, bird, cat, cow, dog, horse, sheep, and person. Each image in this dataset has pixel-level segmentation annotations, bounding box annotations, and object class annotations. This dataset has been widely used as a benchmark for object detection, semantic segmentation, and classification tasks. The PASCAL VOC dataset is split into three subsets: 1,464 images for training, 1,449 images for validation and a private testing set.
269 PAPERS • 50 BENCHMARKS
Foggy Cityscapes is a synthetic foggy dataset which simulates fog on real scenes. Each foggy image is rendered with a clear image and depth map from Cityscapes. Thus the annotations and data split in Foggy Cityscapes are inherited from Cityscapes.
147 PAPERS • 3 BENCHMARKS
HICO-DET is a dataset for detecting human-object interactions (HOI) in images. It contains 47,776 images (38,118 in train set and 9,658 in test set), 600 HOI categories constructed by 80 object categories and 117 verb classes. HICO-DET provides more than 150k annotated human-object pairs. V-COCO provides 10,346 images (2,533 for training, 2,867 for validating and 4,946 for testing) and 16,199 person instances. Each person has annotations for 29 action categories and there are no interaction labels including objects.
112 PAPERS • 5 BENCHMARKS
PASCAL VOC 2007 is a dataset for image recognition. The twenty object classes that have been selected are:
98 PAPERS • 11 BENCHMARKS
In Clipart1k, the target domain classes to be detected are the same as those in the source domain. All the images for a clipart domain were collected from one dataset (i.e., CMPlaces) and two image search engines (i.e., Openclipart2 and Pixabay3). Search queries used are 205 scene classes (e.g., pasture) used in CMPlaces to collect various objects and scenes with complex backgrounds.
29 PAPERS • NO BENCHMARKS YET
Watercolor2k is a dataset used for cross-domain object detection which contains 2k watercolor images with image and instance-level annotations.
27 PAPERS • 2 BENCHMARKS
Comic2k is a dataset used for cross-domain object detection which contains 2k comic images with image and instance-level annotations. Image Source: https://naoto0804.github.io/cross_domain_detection/
20 PAPERS • 4 BENCHMARKS
People-Art is an object detection dataset which consists of people in 43 different styles. People contained in this dataset are quite different from those in common photographs. There are 42 categories of art styles and movements including Naturalism, Cubism, Socialist Realism, Impressionism, and Suprematism
10 PAPERS • 2 BENCHMARKS
This dataset contains 5955 painting images (from WikiCommons) : a train set of 2978 images and a test set of 2977 images (for classification task). 1480 of the 2977 images are annotated with bounding boxes for 7 iconographic classes : ‘angel’,‘Child_Jesus’,‘crucifixion_of_Jesus’,‘Mary’,‘nudity’, ‘ruins’,‘Saint_Sebastien’.
6 PAPERS • 1 BENCHMARK