PatchCamelyon is an image classification dataset. It consists of 327.680 color images (96 x 96px) extracted from histopathologic scans of lymph node sections. Each image is annotated with a binary label indicating presence of metastatic tissue. PCam provides a new benchmark for machine learning models: bigger than CIFAR10, smaller than ImageNet, trainable on a single GPU.
84 PAPERS • 3 BENCHMARKS
The goal of the SUN360 panorama database is to provide academic researchers in computer vision, computer graphics and computational photography, cognition and neuroscience, human perception, machine learning and data mining, with a comprehensive collection of annotated panoramas covering 360x180-degree full view for a large variety of environmental scenes, places and the objects within. To build the core of the dataset, the authors download a huge number of high-resolution panorama images from the Internet, and group them into different place categories. Then, they designed a WebGL annotation tool for annotating the polygons and cuboids for objects in the scene.
84 PAPERS • 1 BENCHMARK
OCR is inevitably linked to NLP since its final output is in text. Advances in document intelligence are driving the need for a unified technology that integrates OCR with various NLP tasks, especially semantic parsing. Since OCR and semantic parsing have been studied as separate tasks so far, the datasets for each task on their own are rich, while those for the integrated post-OCR parsing tasks are relatively insufficient. In this study, we publish a consolidated dataset for receipt parsing as the first step towards post-OCR parsing tasks. The dataset consists of thousands of Indonesian receipts, which contains images and box/text annotations for OCR, and multi-level semantic labels for parsing. The proposed dataset can be used to address various OCR and parsing tasks.
83 PAPERS • 1 BENCHMARK
ETHD is a multi-view stereo benchmark / 3D reconstruction benchmark that covers a variety of indoor and outdoor scenes. Ground truth geometry has been obtained using a high-precision laser scanner. A DSLR camera as well as a synchronized multi-camera rig with varying field-of-view was used to capture images.
82 PAPERS • 1 BENCHMARK
The Synthetic Rain Datasets consists of 13,712 clean-rain image pairs gathered from multiple datasets (Rain14000, Rain1800, Rain800, Rain12). With a single trained model, evaluation could be performed on various test sets, including Rain100H, Rain100L, Test100, Test2800, and Test1200.
82 PAPERS • 5 BENCHMARKS
T-LESS is a dataset for estimating the 6D pose, i.e. translation and rotation, of texture-less rigid objects. The dataset features thirty industry-relevant objects with no significant texture and no discriminative color or reflectance properties. The objects exhibit symmetries and mutual similarities in shape and/or size. Compared to other datasets, a unique property is that some of the objects are parts of others. The dataset includes training and test images that were captured with three synchronized sensors, specifically a structured-light and a time-of-flight RGB-D sensor and a high-resolution RGB camera. There are approximately 39K training and 10K test images from each sensor. Additionally, two types of 3D models are provided for each object, i.e. a manually created CAD model and a semi-automatically reconstructed one. Training images depict individual objects against a black background. Test images originate from twenty test scenes having varying complexity, which increases from
82 PAPERS • 2 BENCHMARKS
VITON was a dataset for virtual try-on of clothing items. It consisted of 16,253 pairs of images of a person and a clothing item .
The COCO-Text dataset is a dataset for text detection and recognition. It is based on the MS COCO dataset, which contains images of complex everyday scenes. The COCO-Text dataset contains non-text images, legible text images and illegible text images. In total there are 22184 training images and 7026 validation images with at least one instance of legible text.
81 PAPERS • 2 BENCHMARKS
The UCSD Anomaly Detection Dataset was acquired with a stationary camera mounted at an elevation, overlooking pedestrian walkways. The crowd density in the walkways was variable, ranging from sparse to very crowded. In the normal setting, the video contains only pedestrians. Abnormal events are due to either: the circulation of non pedestrian entities in the walkways anomalous pedestrian motion patterns Commonly occurring anomalies include bikers, skaters, small carts, and people walking across a walkway or in the grass that surrounds it. A few instances of people in wheelchair were also recorded. All abnormalities are naturally occurring, i.e. they were not staged for the purposes of assembling the dataset. The data was split into 2 subsets, each corresponding to a different scene. The video footage recorded from each scene was split into various clips of around 200 frames.
81 PAPERS • 4 BENCHMARKS
The CUHK-PEDES dataset is a caption-annotated pedestrian dataset. It contains 40,206 images over 13,003 persons. Images are collected from five existing person re-identification datasets, CUHK03, Market-1501, SSM, VIPER, and CUHK01 while each image is annotated with 2 text descriptions by crowd-sourcing workers. Sentences incorporate rich details about person appearances, actions, poses.
80 PAPERS • 4 BENCHMARKS
The Image Shadow Triplets dataset (ISTD) is a dataset for shadow understanding that contains 1870 image triplets of shadow image, shadow mask, and shadow-free image.
80 PAPERS • 2 BENCHMARKS
Indian Pines is a Hyperspectral image segmentation dataset. The input data consists of hyperspectral bands over a single landscape in Indiana, US, (Indian Pines data set) with 145×145 pixels. For each pixel, the data set contains 220 spectral reflectance bands which represent different portions of the electromagnetic spectrum in the wavelength range 0.4−2.5⋅10−6.
RAVEN consists of 1,120,000 images and 70,000 RPM (Raven's Progressive Matrices) problems, equally distributed in 7 distinct figure configurations.
80 PAPERS • NO BENCHMARKS YET
WikiArt contains painting from 195 different artists. The dataset has 42129 images for training and 10628 images for testing.
xView is one of the largest publicly available datasets of overhead imagery. It contains images from complex scenes around the world, annotated using bounding boxes. It contains over 1M object instances from 60 different classes.
80 PAPERS • 1 BENCHMARK
ChestX-ray8 is a medical imaging dataset which comprises 108,948 frontal-view X-ray images of 32,717 (collected from the year of 1992 to 2015) unique patients with the text-mined eight common disease labels, mined from the text radiological reports via NLP techniques.
78 PAPERS • NO BENCHMARKS YET
ImageNet-O consists of images from classes that are not found in the ImageNet-1k dataset. It is used to test the robustness of vision models to out-of-distribution samples. It's reported using the AUPR metric.
The Light Field Saliency Database (LFSD) contains 100 light fields with 360×360 spatial resolution. A rough focal stack and an all-focus image are provided for each light field. The images in this dataset usually have one salient foreground object and a background with good color contrast.
78 PAPERS • 1 BENCHMARK
The Oulu-CASIA NIR&VIS facial expression database consists of six expressions (surprise, happiness, sadness, anger, fear and disgust) from 80 people between 23 and 58 years old. 73.8% of the subjects are males. The subjects were asked to sit on a chair in the observation room in a way that he/ she is in front of camera. Camera-face distance is about 60 cm. Subjects were asked to make a facial expression according to an expression example shown in picture sequences. The imaging hardware works at the rate of 25 frames per second and the image resolution is 320 × 240 pixels.
78 PAPERS • 4 BENCHMARKS
VQG is a collection of datasets for visual question generation. VQG questions were collected by crowdsourcing the task on Amazon Mechanical Turk (AMT). The authors provided details on the prompt and the specific instructions for all the crowdsourcing tasks in this paper in the supplementary material. The prompt was successful at capturing nonliteral questions. Images were taken from the MSCOCO dataset.
CULane is a large scale challenging dataset for academic research on traffic lane detection. It is collected by cameras mounted on six different vehicles driven by different drivers in Beijing. More than 55 hours of videos were collected and 133,235 frames were extracted. The dataset is divided into 88880 images for training set, 9675 for validation set, and 34680 for test set. The test set is divided into normal and 8 challenging categories.
77 PAPERS • 1 BENCHMARK
MPIIGaze is a dataset for appearance-based gaze estimation in the wild. It contains 213,659 images collected from 15 participants during natural everyday laptop use over more than three months. It has a large variability in appearance and illumination.
NLVR contains 92,244 pairs of human-written English sentences grounded in synthetic images. Because the images are synthetically generated, this dataset can be used for semantic parsing.
77 PAPERS • 3 BENCHMARKS
The Radboud Faces Database (RaFD) is a set of pictures of 67 models (both adult and children, males and females) displaying 8 emotional expressions.
77 PAPERS • 2 BENCHMARKS
ST-VQA aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process.
77 PAPERS • NO BENCHMARKS YET
Volleyball is a video action recognition dataset. It has 4830 annotated frames that were handpicked from 55 videos with 9 player action labels and 8 team activity labels. It contains group activity annotations as well as individual activity annotations.
The Parkinson’s Progression Markers Initiative (PPMI) dataset originates from an observational clinical and longitudinal study comprising evaluations of people with Parkinson’s disease (PD), those people with high risk, and those who are healthy.
76 PAPERS • 3 BENCHMARKS
AbstractReasoning is a dataset for abstract reasoning, where the goal is to infer the correct answer from the context panels based on abstract reasoning.
75 PAPERS • NO BENCHMARKS YET
Consists of 8,422 blurry and sharp image pairs with 65,784 densely annotated FG human bounding boxes.
74 PAPERS • 4 BENCHMARKS
The Oxford-IIIT Pet Dataset has 37 categories with roughly 200 images for each class. The images have a large variations in scale, pose and lighting. All images have an associated ground truth annotation of breed, head ROI, and pixel level trimap segmentation.
74 PAPERS • 5 BENCHMARKS
The PROMISE12 dataset was made available for the MICCAI 2012 prostate segmentation challenge. Magnetic Resonance (MR) images (T2-weighted) of 50 patients with various diseases were acquired at different locations with several MRI vendors and scanning protocols.
74 PAPERS • 2 BENCHMARKS
Places-LT has an imbalanced training set with 62,500 images for 365 classes from Places-2. The class frequencies follow a natural power law distribution with a maximum number of 4,980 images per class and a minimum number of 5 images per class. The validation and testing sets are balanced and contain 20 and 100 images per class respectively.
74 PAPERS • 1 BENCHMARK
VQA-RAD consists of 3,515 question–answer pairs on 315 radiology images.
The BRATS2017 dataset. It contains 285 brain tumor MRI scans, with four MRI modalities as T1, T1ce, T2, and Flair for each scan. The dataset also provides full masks for brain tumors, with labels for ED, ET, NET/NCR. The segmentation evaluation is based on three tasks: WT, TC and ET segmentation.
73 PAPERS • 1 BENCHMARK
GuessWhat?! is a large-scale dataset consisting of 150K human-played games with a total of 800K visual question-answer pairs on 66K images.
73 PAPERS • NO BENCHMARKS YET
SIM10k is a synthetic dataset containing 10,000 images, which is rendered from the video game Grand Theft Auto V (GTA5).
73 PAPERS • 3 BENCHMARKS
The ReferIt dataset contains 130,525 expressions for referring to 96,654 objects in 19,894 images of natural scenes.
72 PAPERS • NO BENCHMARKS YET
Contains 145k captions for 28k images. The dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects.
71 PAPERS • 1 BENCHMARK
The TuSimple dataset consists of 6,408 road images on US highways. The resolution of image is 1280×720. The dataset is composed of 3,626 for training, 358 for validation, and 2,782 for testing called the TuSimple test set of which the images are under different weather conditions.
ONCE (One millioN sCenEs) is a dataset for 3D object detection in the autonomous driving scenario. The ONCE dataset consists of 1 million LiDAR scenes and 7 million corresponding camera images. The data is selected from 144 driving hours, which is 20x longer than other 3D autonomous driving datasets available like nuScenes and Waymo, and it is collected across a range of different areas, periods and weather conditions.
70 PAPERS • NO BENCHMARKS YET
The UT-Kinect dataset is a dataset for action recognition from depth sequences. The videos were captured using a single stationary Kinect. There are 10 action types: walk, sit down, stand up, pick up, carry, throw, push, pull, wave hands, clap hands. There are 10 subjects, Each subject performs each actions twice. Three channels were recorded: RGB, depth and skeleton joint locations. The three channel are synchronized. The framerate is 30f/s.
70 PAPERS • 2 BENCHMARKS
BigEarthNet consists of 590,326 Sentinel-2 image patches, each of which is a section of i) 120x120 pixels for 10m bands; ii) 60x60 pixels for 20m bands; and iii) 20x20 pixels for 60m bands.
69 PAPERS • 3 BENCHMARKS
ApolloScape is a large dataset consisting of over 140,000 video frames (73 street scene videos) from various locations in China under varying weather conditions. Pixel-wise semantic annotation of the recorded data is provided in 2D, with point-wise semantic annotation in 3D for 28 classes. In addition, the dataset contains lane marking annotations in 2D.
68 PAPERS • 5 BENCHMARKS
A large-scale multi-object tracking dataset for human tracking in occlusion, frequent crossover, uniform appearance and diverse body gestures. It is proposed to emphasize the importance of motion analysis in multi-object tracking instead of mainly appearance-matching-based diagram.
68 PAPERS • 1 BENCHMARK
The MIT-States dataset has 245 object classes, 115 attribute classes and ∼53K images. There is a wide range of objects (e.g., fish, persimmon, room) and attributes (e.g., mossy, deflated, dirty). On average, each object instance is modified by one of the 9 attributes it affords.
68 PAPERS • 4 BENCHMARKS
SEED-Bench consists of 19K multiple choice questions with accurate human annotations (~6 larger than existing benchmarks), which spans 12 evaluation dimensions including the comprehension of both the image and video modality.
68 PAPERS • NO BENCHMARKS YET
The Comprehensive Cars (CompCars) dataset contains data from two scenarios, including images from web-nature and surveillance-nature. The web-nature data contains 163 car makes with 1,716 car models. There are a total of 136,726 images capturing the entire cars and 27,618 images capturing the car parts. The full car images are labeled with bounding boxes and viewpoints. Each car model is labeled with five attributes, including maximum speed, displacement, number of doors, number of seats, and type of car. The surveillance-nature data contains 50,000 car images captured in the front view.
67 PAPERS • 1 BENCHMARK
DICM is a dataset for low-light enhancement which consists of 69 images collected with commercial digital cameras.