The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a benchmark in image classification and object detection. The publicly released dataset contains a set of manually annotated training images. A set of test images is also released, with the manual annotations withheld. ILSVRC annotations fall into one of two categories: (1) image-level annotation of a binary label for the presence or absence of an object class in the image, e.g., “there are cars in this image” but “there are no tigers,” and (2) object-level annotation of a tight bounding box and class label around an object instance in the image, e.g., “there is a screwdriver centered at position (20,25) with width of 50 pixels and height of 30 pixels”. The ImageNet project does not own the copyright of the images, therefore only thumbnails and URLs of images are provided.
8,119 PAPERS • 55 BENCHMARKS
The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.
5,725 PAPERS • 57 BENCHMARKS
The CIFAR-10 dataset (Canadian Institute for Advanced Research, 10 classes) is a subset of the Tiny Images dataset and consists of 60000 32x32 color images. The images are labelled with one of 10 mutually exclusive classes: airplane, automobile (but not truck or pickup truck), bird, cat, deer, dog, frog, horse, ship, and truck (but not pickup truck). There are 6000 images per class with 5000 training and 1000 testing images per class.
5,658 PAPERS • 44 BENCHMARKS
The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger NIST Special Database 3 (digits written by employees of the United States Census Bureau) and Special Database 1 (digits written by high school students) which contain monochrome images of handwritten digits. The digits have been size-normalized and centered in a fixed-size image. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.
4,076 PAPERS • 36 BENCHMARKS
The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists of 60000 32x32 color images. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. There are 600 images per class. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). There are 500 training images and 100 testing images per class.
2,581 PAPERS • 27 BENCHMARKS
Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense pixel annotations for 30 classes grouped into 8 categories (flat surfaces, humans, vehicles, constructions, objects, nature, sky, and void). The dataset consists of around 5000 fine annotated images and 20000 coarse annotated ones. Data was captured in 50 cities during several months, daytimes, and good weather conditions. It was originally recorded as video so the frames were manually selected to have the following features: large number of dynamic objects, varying scene layout, and varying background.
1,954 PAPERS • 27 BENCHMARKS
KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile robotics and autonomous driving. It consists of hours of traffic scenarios recorded with a variety of sensor modalities, including high-resolution RGB, grayscale stereo cameras, and a 3D laser scanner. Despite its popularity, the dataset itself does not contain ground truth for semantic segmentation. However, various researchers have manually annotated parts of the dataset to fit their necessities. Álvarez et al. generated ground truth for 323 images from the road detection challenge with three classes: road, vertical, and sky. Zhang et al. annotated 252 (140 for training and 112 for testing) acquisitions – RGB and Velodyne scans – from the tracking challenge for ten object categories: building, sky, road, vegetation, sidewalk, car, pedestrian, cyclist, sign/pole, and fence. Ros et al. labeled 170 training images and 46 testing images (from the visual odome
1,751 PAPERS • 87 BENCHMARKS
CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels indicating facial attributes like hair color, gender and age.
1,609 PAPERS • 10 BENCHMARKS
The Street View House Number (SVHN) is a digit classification benchmark dataset that contains 600000 32×32 RGB images of printed digits (from 0 to 9) cropped from pictures of house number plates. The cropped images are centered in the digit of interest, but nearby digits and other distractors are kept in the image. SVHN has three sets: training, testing sets and an extra set with 530000 images that are less difficult and can be used for helping with the training process
1,523 PAPERS • 10 BENCHMARKS
Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per category. The training set has 60,000 images and the test set has 10,000 images. Fashion-MNIST shares the same image size, data format and the structure of training and testing splits with the original MNIST.
1,363 PAPERS • 9 BENCHMARKS
Visual Genome contains Visual Question Answering data in a multi-choice setting. It consists of 101,174 images from MSCOCO with 1.7 million QA pairs, 17 questions per image on average. Compared to the Visual Question Answering dataset, Visual Genome represents a more balanced distribution over 6 question types: What, Where, When, Who, Why and How. The Visual Genome dataset also presents 108K images with densely annotated objects, attributes and relationships.
903 PAPERS • 11 BENCHMARKS
The LFW dataset contains 13,233 images of faces collected from the web. This dataset consists of the 5749 identities with 1680 people with two or more images. In the standard LFW evaluation protocol the verification accuracies are reported on 6000 face pairs.
645 PAPERS • 5 BENCHMARKS
The miniImageNet dataset contains 100 classes randomly chosen from ImageNet ILSVRC-2012 challenge with 600 images of size 84×84 pixels per class. It is split into 64 base classes, 16 validation classes and 20 novel classes
556 PAPERS • 15 BENCHMARKS
The Places dataset is proposed for scene recognition and contains more than 2.5 million images covering more than 205 scene categories with more than 5,000 images per category.
538 PAPERS • 3 BENCHMARKS
CLEVR (Compositional Language and Elementary Visual Reasoning) is a synthetic Visual Question Answering dataset. It contains images of 3D-rendered objects; each image comes with a number of highly compositional questions that fall into different categories. Those categories fall into 5 classes of tasks: Exist, Count, Compare Integer, Query Attribute and Compare Attribute. The CLEVR dataset consists of: a training set of 70k images and 700k questions, a validation set of 15k images and 150k questions, A test set of 15k images and 150k questions about objects, answers, scene graphs and functional programs for all train and validation images and questions. Each object present in the scene, aside of position, is characterized by a set of four attributes: 2 sizes: large, small, 3 shapes: square, cylinder, sphere, 2 material types: rubber, metal, 8 color types: gray, blue, brown, yellow, red, green, purple, cyan, resulting in 96 unique combinations.
528 PAPERS • 2 BENCHMARKS
Market-1501 is a large-scale public benchmark dataset for person re-identification. It contains 1501 identities which are captured by six different cameras, and 32,668 pedestrian image bounding-boxes obtained using the Deformable Part Models pedestrian detector. Each person has 3.6 images on average at each viewpoint. The dataset is split into two parts: 750 identities are utilized for training and the remaining 751 identities are used for testing. In the official testing protocol 3,368 query images are selected as probe set to find the correct match across 19,732 reference gallery images.
512 PAPERS • 9 BENCHMARKS
The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of 200 subcategories belonging to birds, 5,994 for training and 5,794 for testing. Each image has detailed annotations: 1 subcategory label, 15 part locations, 312 binary attributes and 1 bounding box. The textual information comes from Reed et al.. They expand the CUB-200-2011 dataset by collecting fine-grained natural language descriptions. Ten single-sentence descriptions are collected for each image. The natural language descriptions are collected through the Amazon Mechanical Turk (AMT) platform, and are required at least 10 words, without any information of subcategories and actions.
506 PAPERS • 26 BENCHMARKS
The SYNTHIA dataset is a synthetic dataset that consists of 9400 multi-viewpoint photo-realistic frames rendered from a virtual city and comes with pixel-level semantic annotations for 13 classes. Each frame has resolution of 1280 × 960.
450 PAPERS • 6 BENCHMARKS
Visual Question Answering (VQA) is a dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. The first version of the dataset was released in October 2015. VQA v2.0 was released in April 2017.
440 PAPERS • 2 BENCHMARKS
ScanNet is an instance-level indoor RGB-D dataset that includes both 2D and 3D data. It is a collection of labeled voxels rather than points or objects. Up to now, ScanNet v2, the newest version of ScanNet, has collected 1513 annotated scans with an approximate 90% surface coverage. In the semantic segmentation task, this dataset is marked in 20 classes of annotated 3D voxelized objects.
424 PAPERS • 8 BENCHMARKS
The Human3.6M dataset is one of the largest motion capture datasets, which consists of 3.6 million human poses and corresponding images captured by a high-speed motion capture system. There are 4 high-resolution progressive scan cameras to acquire video data at 50 Hz. The dataset contains activities by 11 professional actors in 17 scenarios: discussion, smoking, taking photo, talking on the phone, etc., as well as provides accurate 3D joint positions and high-resolution videos.
420 PAPERS • 9 BENCHMARKS
The ADE20K semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. There are totally 150 semantic categories, which include stuffs like sky, road, grass, and discrete objects like person, car, bed.
416 PAPERS • 9 BENCHMARKS
COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
415 PAPERS • 5 BENCHMARKS
The MPII Human Pose Dataset for single person pose estimation is composed of about 25K images of which 15K are training samples, 3K are validation samples and 7K are testing samples (which labels are withheld by the authors). The images are taken from YouTube videos covering 410 different human activities and the poses are manually annotated with up to 16 body joints.
411 PAPERS • 3 BENCHMARKS
The Large-scale Scene Understanding (LSUN) challenge aims to provide a different benchmark for large-scale scene classification and understanding. The LSUN classification dataset contains 10 scene categories, such as dining room, bedroom, chicken, outdoor church, and so on. For training data, each category contains a huge number of images, ranging from around 120,000 to 3,000,000. The validation data includes 300 images, and the test data has 1000 images for each category.
410 PAPERS • 9 BENCHMARKS
Caltech-256 is an object recognition dataset containing 30,607 real-world images, of different sizes, spanning 257 classes (256 object classes and an additional clutter class). Each class is represented by at least 80 images. The dataset is a superset of the Caltech-101 dataset.
404 PAPERS • 3 BENCHMARKS
The VGGFace2 dataset is made of around 3.31 million images divided into 9131 classes, each representing a different person identity. The dataset is divided into two splits, one for the training and one for test. The latter contains around 170000 images divided into 500 identities while all the other images belong to the remaining 8631 classes available for training. While constructing the datasets, the authors focused their efforts on reaching a very low label noise and a high pose and age diversity thus, making the VGGFace2 dataset a suitable choice to train state-of-the-art deep learning models on face-related tasks. The images of the training set have an average resolution of 137x180 pixels, with less than 1% at a resolution below 32 pixels (considering the shortest side).
404 PAPERS • 2 BENCHMARKS
The Multi-PIE (Multi Pose, Illumination, Expressions) dataset consists of face images of 337 subjects taken under different pose, illumination and expressions. The pose range contains 15 discrete views, capturing a face profile-to-profile. Illumination changes were modeled using 19 flashlights located in different places of the room.
400 PAPERS • 1 BENCHMARK
The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators.
380 PAPERS • 6 BENCHMARKS
The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or self-taught learning. Besides 100,000 unlabeled images, it contains 13,000 labeled images from 10 object classes (such as birds, cats, trucks), among which 5,000 images are partitioned for training while the remaining 8,000 images for testing. All the images are color images with 96×96 pixels in size.
369 PAPERS • 11 BENCHMARKS
BSD is a dataset used frequently for image denoising and super-resolution. Of the subdatasets, BSD100 is aclassical image dataset having 100 test images proposed by Martin et al.. The dataset is composed of a large variety of images ranging from natural images to object-specific such as plants, people, food etc. BSD100 is the testing set of the Berkeley segmentation dataset BSD300.
368 PAPERS • 42 BENCHMARKS
Omniglot is a large dataset of hand-written characters with 1623 characters and 20 examples for each character. These characters are collected based upon 50 alphabets from different countries. It contains both images and strokes data. Stroke data are coordinates with time in miliseconds.
366 PAPERS • 14 BENCHMARKS
The MS-Celeb-1M dataset is a large-scale face recognition dataset consists of 100K identities, and each identity has about 100 facial images. The original identity labels are obtained automatically from webpages.
348 PAPERS • NO BENCHMARKS YET
DeepFashion is a dataset containing around 800K diverse fashion images with their rich annotations (46 categories, 1,000 descriptive attributes, bounding boxes and landmark information) ranging from well-posed product images to real-world-like consumer photos.
343 PAPERS • 4 BENCHMARKS
The CUHK03 consists of 14,097 images of 1,467 different identities, where 6 campus cameras were deployed for image collection and each identity is captured by 2 campus cameras. This dataset provides two types of annotations, one by manually labelled bounding boxes and the other by bounding boxes produced by an automatic detector. The dataset also provides 20 random train/test splits in which 100 identities are selected for testing and the rest for training
318 PAPERS • 6 BENCHMARKS
The Caltech101 dataset contains images from 101 object categories (e.g., “helicopter”, “elephant” and “chair” etc.) and a background category that contains the images not from the 101 object categories. For each object category, there are about 40 to 800 images, while most classes have about 50 images. The resolution of the image is roughly about 300×200 pixels.
318 PAPERS • 6 BENCHMARKS
The SUN RGBD dataset contains 10335 real RGB-D images of room scenes. Each RGB image has a corresponding depth and segmentation map. As many as 700 object categories are labeled. The training and testing sets contain 5285 and 5050 images, respectively.
300 PAPERS • 6 BENCHMARKS
LabelMe database is a large collection of images with ground truth labels for object detection and recognition. The annotations come from two different sources, including the LabelMe online annotation tool.
292 PAPERS • 1 BENCHMARK
The Make3D dataset is a monocular Depth Estimation dataset that contains 400 single training RGB and depth map pairs, and 134 test samples. The RGB images have high resolution, while the depth maps are provided at low resolution.
289 PAPERS • 1 BENCHMARK
ChestX-ray8 is a medical imaging dataset which comprises 108,948 frontal-view X-ray images of 32,717 (collected from the year of 1992 to 2015) unique patients with the text-mined eight common disease labels, mined from the text radiological reports via NLP techniques.
284 PAPERS • NO BENCHMARKS YET
The image dataset TinyImages contains 80 million images of size 32×32 collected from the Internet, crawling the words in WordNet.
279 PAPERS • NO BENCHMARKS YET
The CelebA-HQ dataset is a high-quality version of CelebA that consists of 30,000 images at 1024×1024 resolution.
277 PAPERS • 8 BENCHMARKS
The CASIA-WebFace dataset is used for face verification and face identification tasks. The dataset contains 494,414 face images of 10,575 real identities collected from the web.
272 PAPERS • 2 BENCHMARKS
The Office dataset contains 31 object categories in three domains: Amazon, DSLR and Webcam. The 31 categories in the dataset consist of objects commonly encountered in office settings, such as keyboards, file cabinets, and laptops. The Amazon domain contains on average 90 images per class and 2817 images in total. As these images were captured from a website of online merchants, they are captured against clean background and at a unified scale. The DSLR domain contains 498 low-noise high resolution images (4288×2848). There are 5 objects per category. Each object was captured from different viewpoints on average 3 times. For Webcam, the 795 images of low resolution (640×480) exhibit significant noise and color as well as white balance artifacts.
260 PAPERS • 7 BENCHMARKS
Visual7W is a large-scale visual question answering (QA) dataset, with object-level groundings and multimodal answers. Each question starts with one of the seven Ws, what, where, when, who, why, how and which. It is collected from 47,300 COCO iamges and it has 327,929 QA pairs, together with 1,311,756 human-generated multiple-choices and 561,459 object groundings from 36,579 categories.
255 PAPERS • 1 BENCHMARK
The ICDAR 2013 dataset consists of 229 training images and 233 testing images, with word-level annotations provided. It is the standard benchmark dataset for evaluating near-horizontal text detection.
253 PAPERS • 3 BENCHMARKS
Taskonomy provides a large and high-quality dataset of varied indoor scenes.
253 PAPERS • 2 BENCHMARKS