Perceptual Similarity is a dataset of human perceptual similarity judgments.
223 PAPERS • NO BENCHMARKS YET
WMT 2014 is a collection of datasets used in shared tasks of the Ninth Workshop on Statistical Machine Translation. The workshop featured four tasks:
223 PAPERS • 12 BENCHMARKS
The Winograd Schema Challenge was introduced both as an alternative to the Turing Test and as a test of a system’s ability to do commonsense reasoning. A Winograd schema is a pair of sentences differing in one or two words with a highly ambiguous pronoun, resolved differently in the two sentences, that appears to require commonsense knowledge to be resolved correctly. The examples were designed to be easily solvable by humans but difficult for machines, in principle requiring a deep understanding of the content of the text and the situation it describes.
220 PAPERS • 4 BENCHMARKS
The NewsQA dataset is a crowd-sourced machine reading comprehension dataset of 120,000 question-answer pairs.
219 PAPERS • 1 BENCHMARK
The Shanghaitech dataset is a large-scale crowd counting dataset. It consists of 1198 annotated crowd images. The dataset is divided into two parts, Part-A containing 482 images and Part-B containing 716 images. Part-A is split into train and test subsets consisting of 300 and 182 images, respectively. Part-B is split into train and test subsets consisting of 400 and 316 images. Each person in a crowd image is annotated with one point close to the center of the head. In total, the dataset consists of 330,165 annotated people. Images from Part-A were collected from the Internet, while images from Part-B were collected on the busy streets of Shanghai.
218 PAPERS • 3 BENCHMARKS
The DeepMind Control Suite (DMCS) is a set of simulated continuous control environments with a standardized structure and interpretable rewards. The tasks are written and powered by the MuJoCo physics engine, making them easy to identify. Control Suite tasks include Pendulum, Acrobot, Cart-pole, Cart-k-pole, Ball in cup, Point-mass, Reacher, Finger, Hooper, Fish, Cheetah, Walker, Manipulator, Manipulator extra, Stacker, Swimmer, Humanoid, Humanoid_CMU and LQR.
217 PAPERS • 11 BENCHMARKS
FaceForensics++ is a forensics dataset consisting of 1000 original video sequences that have been manipulated with four automated face manipulation methods: Deepfakes, Face2Face, FaceSwap and NeuralTextures. The data has been sourced from 977 youtube videos and all videos contain a trackable mostly frontal face without occlusions which enables automated tampering methods to generate realistic forgeries.
216 PAPERS • 2 BENCHMARKS
The ATIS (Airline Travel Information Systems) is a dataset consisting of audio recordings and corresponding manual transcripts about humans asking for flight information on automated airline travel inquiry systems. The data consists of 17 unique intent categories. The original split contains 4478, 500 and 893 intent-labeled reference utterances in train, development and test set respectively.
214 PAPERS • 5 BENCHMARKS
Berkeley Segmentation Data Set 500 (BSDS500) is a standard benchmark for contour detection. This dataset is designed for evaluating natural edge detection that includes not only object contours but also object interior boundaries and background boundaries. It includes 500 natural images with carefully annotated boundaries collected from multiple users. The dataset is divided into three parts: 200 for training, 100 for validation and the rest 200 for test.
210 PAPERS • 8 BENCHMARKS
The Waymo Open Dataset is comprised of high resolution sensor data collected by autonomous vehicles operated by the Waymo Driver in a wide variety of conditions.
210 PAPERS • 9 BENCHMARKS
The Digital Retinal Images for Vessel Extraction (DRIVE) dataset is a dataset for retinal vessel segmentation. It consists of a total of JPEG 40 color fundus images; including 7 abnormal pathology cases. The images were obtained from a diabetic retinopathy screening program in the Netherlands. The images were acquired using Canon CR5 non-mydriatic 3CCD camera with FOV equals to 45 degrees. Each image resolution is 584*565 pixels with eight bits per color channel (3 channels).
208 PAPERS • 2 BENCHMARKS
IMDB-BINARY is a movie collaboration dataset that consists of the ego-networks of 1,000 actors/actresses who played roles in movies in IMDB. In each graph, nodes represent actors/actress, and there is an edge between them if they appear in the same movie. These graphs are derived from the Action and Romance genres.
208 PAPERS • 2 BENCHMARKS
The MS-Celeb-1M dataset is a large-scale face recognition dataset consists of 100K identities, and each identity has about 100 facial images. The original identity labels are obtained automatically from webpages.
208 PAPERS • NO BENCHMARKS YET
Discrete Reasoning Over Paragraphs DROP is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets. The questions consist of passages extracted from Wikipedia articles. The dataset is split into a training set of about 77,000 questions, a development set of around 9,500 questions and a hidden test set similar in size to the development set.
206 PAPERS • 2 BENCHMARKS
The Pascal3D+ multi-view dataset consists of images in the wild, i.e., images of object categories exhibiting high variability, captured under uncontrolled settings, in cluttered scenes and under many different poses. Pascal3D+ contains 12 categories of rigid objects selected from the PASCAL VOC 2012 dataset. These objects are annotated with pose information (azimuth, elevation and distance to camera). Pascal3D+ also adds pose annotated images of these 12 categories from the ImageNet dataset.
204 PAPERS • 1 BENCHMARK
Argoverse is a tracking benchmark with over 30K scenarios collected in Pittsburgh and Miami. Each scenario is a sequence of frames sampled at 10 HZ. Each sequence has an interesting object called “agent”, and the task is to predict the future locations of agents in a 3 seconds future horizon. The sequences are split into training, validation and test sets, which have 205,942, 39,472 and 78,143 sequences respectively. These splits have no geographical overlap.
203 PAPERS • 5 BENCHMARKS
The ICDAR 2013 dataset consists of 229 training images and 233 testing images, with word-level annotations provided. It is the standard benchmark dataset for evaluating near-horizontal text detection.
203 PAPERS • 3 BENCHMARKS
The Microsoft Research Video Description Corpus (MSVD) dataset consists of about 120K sentences collected during the summer of 2010. Workers on Mechanical Turk were paid to watch a short video snippet and then summarize the action in a single sentence. The result is a set of roughly parallel descriptions of more than 2,000 video snippets. Because the workers were urged to complete the task in the language of their choice, both paraphrase and bilingual alternations are captured in the data.
203 PAPERS • 5 BENCHMARKS
D4RL is a collection of environments for offline reinforcement learning. These environments include Maze2D, AntMaze, Adroit, Gym, Flow, FrankKitchen and CARLA.
202 PAPERS • 14 BENCHMARKS
The GoPro dataset for deblurring consists of 3,214 blurred images with the size of 1,280×720 that are divided into 2,103 training images and 1,111 test images. The dataset consists of pairs of a realistic blurry image and the corresponding ground truth shapr image that are obtained by a high-speed camera.
202 PAPERS • 2 BENCHMARKS
JHMDB is an action recognition dataset that consists of 960 video sequences belonging to 21 actions. It is a subset of the larger HMDB51 dataset collected from digitized movies and YouTube videos. The dataset contains video and annotation for puppet flow per frame (approximated optimal flow on the person), puppet mask per frame, joint positions per frame, action label per clip and meta label per clip (camera motion, visible body parts, camera viewpoint, number of people, video quality).
202 PAPERS • 8 BENCHMARKS
Clothing1M contains 1M clothing images in 14 classes. It is a dataset with noisy labels, since the data is collected from several online shopping websites and include many mislabelled samples. This dataset also contains 50k, 14k, and 10k images with clean labels for training, validation, and testing, respectively.
201 PAPERS • 3 BENCHMARKS
The 3D Poses in the Wild dataset is the first dataset in the wild with accurate 3D poses for evaluation. While other datasets outdoors exist, they are all restricted to a small recording volume. 3DPW is the first one that includes video footage taken from a moving phone camera.
200 PAPERS • 4 BENCHMARKS
The Sentences Involving Compositional Knowledge (SICK) dataset is a dataset for compositional distributional semantics. It includes a large number of sentence pairs that are rich in the lexical, syntactic and semantic phenomena. Each pair of sentences is annotated in two dimensions: relatedness and entailment. The relatedness score ranges from 1 to 5, and Pearson’s r is used for evaluation; the entailment relation is categorical, consisting of entailment, contradiction, and neutral. There are 4439 pairs in the train split, 495 in the trial split used for development and 4906 in the test split. The sentence pairs are generated from image and video caption datasets before being paired up using some algorithm.
199 PAPERS • 3 BENCHMARKS
The GQA dataset is a large-scale visual question answering dataset with real images from the Visual Genome dataset and balanced question-answer pairs. Each training and validation image is also associated with scene graph annotations describing the classes and attributes of those objects in the scene, and their pairwise relations. Along with the images and question-answer pairs, the GQA dataset provides two types of pre-extracted visual features for each image – convolutional grid features of size 7×7×2048 extracted from a ResNet-101 network trained on ImageNet, and object detection features of size Ndet×2048 (where Ndet is the number of detected objects in each image with a maximum of 100 per image) from a Faster R-CNN detector.
195 PAPERS • 5 BENCHMARKS
The PASCAL Context dataset is an extension of the PASCAL VOC 2010 detection challenge, and it contains pixel-wise labels for all training images. It contains more than 400 classes (including the original 20 classes plus backgrounds from PASCAL VOC segmentation), divided into three categories (objects, stuff, and hybrids). Many of the object categories of this dataset are too sparse and; therefore, a subset of 59 frequent classes are usually selected for use.
195 PAPERS • 6 BENCHMARKS
CamVid (Cambridge-driving Labeled Video Database) is a road/driving scene understanding database which was originally captured as five video sequences with a 960×720 resolution camera mounted on the dashboard of a car. Those sequences were sampled (four of them at 1 fps and one at 15 fps) adding up to 701 frames. Those stills were manually annotated with 32 classes: void, building, wall, tree, vegetation, fence, sidewalk, parking block, column/pole, traffic cone, bridge, sign, miscellaneous text, traffic light, sky, tunnel, archway, road, road shoulder, lane markings (driving), lane markings (non-driving), animal, pedestrian, child, cart luggage, bicyclist, motorcycle, car, SUV/pickup/truck, truck/bus, train, and other moving object
194 PAPERS • 3 BENCHMARKS
LVIS is a dataset for long tail instance segmentation. It has annotations for over 1000 object categories in 164k images.
194 PAPERS • 7 BENCHMARKS
MSMT17 is a multi-scene multi-time person re-identification dataset. The dataset consists of 180 hours of videos, captured by 12 outdoor cameras, 3 indoor cameras, and during 12 time slots. The videos cover a long period of time and present complex lighting variations, and it contains a large number of annotated identities, i.e., 4,101 identities and 126,441 bounding boxes.
193 PAPERS • 6 BENCHMARKS
DUTS is a saliency detection dataset containing 10,553 training images and 5,019 test images. All training images are collected from the ImageNet DET training/val sets, while test images are collected from the ImageNet DET test set and the SUN data set. Both the training and test set contain very challenging scenarios for saliency detection. Accurate pixel-level ground truths are manually annotated by 50 subjects.
192 PAPERS • 5 BENCHMARKS
EMNIST (extended MNIST) has 4 times more data than MNIST. It is a set of handwritten digits with a 28 x 28 format.
191 PAPERS • 8 BENCHMARKS
The Extended Cohn-Kanade (CK+) dataset contains 593 video sequences from a total of 123 different subjects, ranging from 18 to 50 years of age with a variety of genders and heritage. Each video shows a facial shift from the neutral expression to a targeted peak expression, recorded at 30 frames per second (FPS) with a resolution of either 640x490 or 640x480 pixels. Out of these videos, 327 are labelled with one of seven expression classes: anger, contempt, disgust, fear, happiness, sadness, and surprise. The CK+ database is widely regarded as the most extensively used laboratory-controlled facial expression classification database available, and is used in the majority of facial expression classification methods.
190 PAPERS • 2 BENCHMARKS
The CommonsenseQA is a dataset for commonsense question answering task. The dataset consists of 12,247 questions with 5 choices each. The dataset was generated by Amazon Mechanical Turk workers in the following process (an example is provided in parentheses):
190 PAPERS • 2 BENCHMARKS
The SNIPS Natural Language Understanding benchmark is a dataset of over 16,000 crowdsourced queries distributed among 7 user intents of various complexity:
190 PAPERS • 4 BENCHMARKS
The Human Activity Recognition Dataset has been collected from 30 subjects performing six different activities (Walking, Walking Upstairs, Walking Downstairs, Sitting, Standing, Laying). It consists of inertial sensor data that was collected using a smartphone carried by the subjects.
189 PAPERS • 2 BENCHMARKS
AffectNet is a large facial expression dataset with around 0.4 million images manually labeled for the presence of eight (neutral, happy, angry, sad, fear, surprise, disgust, contempt) facial expressions along with the intensity of valence and arousal.
188 PAPERS • 3 BENCHMARKS
OpenSubtitles is collection of multilingual parallel corpora. The dataset is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages.
188 PAPERS • 1 BENCHMARK
In particular, MUTAG is a collection of nitroaromatic compounds and the goal is to predict their mutagenicity on Salmonella typhimurium. Input graphs are used to represent chemical compounds, where vertices stand for atoms and are labeled by the atom type (represented by one-hot encoding), while edges between vertices represent bonds between the corresponding atoms. It includes 188 samples of chemical compounds with 7 discrete node labels.
187 PAPERS • 2 BENCHMARKS
NAS-Bench-201 is a benchmark (and search space) for neural architecture search. Each architecture consists of a predefined skeleton with a stack of the searched cell. In this way, architecture search is transformed into the problem of searching a good cell.
187 PAPERS • 3 BENCHMARKS
The fastMRI dataset includes two types of MRI scans: knee MRIs and the brain (neuro) MRIs, and containing training, validation, and masked test sets. The deidentified imaging dataset provided by NYU Langone comprises raw k-space data in several sub-dataset groups. Curation of these data are part of an IRB approved study. Raw and DICOM data have been deidentified via conversion to the vendor-neutral ISMRMD format and the RSNA clinical trial processor, respectively. Also, each DICOM image is manually inspected for the presence of any unexpected protected health information (PHI), with spot checking of both metadata and image content. Knee MRI: Data from more than 1,500 fully sampled knee MRIs obtained on 3 and 1.5 Tesla magnets and DICOM images from 10,000 clinical knee MRIs also obtained at 3 or 1.5 Tesla. The raw dataset includes coronal proton density-weighted images with and without fat suppression. The DICOM dataset contains coronal proton density-weighted with and without fat suppr
187 PAPERS • 4 BENCHMARKS
MegaFace was a publicly available dataset which is used for evaluating the performance of face recognition algorithms with up to a million distractors (i.e., up to a million people who are not in the test set). MegaFace contains 1M images from 690K individuals with unconstrained pose, expression, lighting, and exposure. MegaFace captures many different subjects rather than many images of a small number of subjects. The gallery set of MegaFace is collected from a subset of Flickr. The probe set of MegaFace used in the challenge consists of two databases; Facescrub and FGNet. FGNet contains 975 images of 82 individuals, each with several images spanning ages from 0 to 69. Facescrub dataset contains more than 100K face images of 530 people. The MegaFace challenge evaluates performance of face recognition algorithms by increasing the numbers of “distractors” (going from 10 to 1M) in the gallery set. In order to evaluate the face recognition algorithms fairly, MegaFace challenge has two pro
186 PAPERS • 3 BENCHMARKS
The 100 Days Of Hands Dataset (100DOH) is a large-scale video dataset containing hands and hand-object interactions. It consists of 27.3K Youtube videos from 11 categories with nearly 131 days of footage of everyday interaction. The focus of the dataset is hand contact, and it includes both first-person and third-person perspectives. The videos in 100DOH are unconstrained and content-rich, ranging from records of daily life to specific instructional videos. To enforce diversity, the dataset contains no more than 20 videos from each uploader.
185 PAPERS • 3 BENCHMARKS
DAVIS17 is a dataset for video object segmentation. It contains a total of 150 videos - 60 for training, 30 for validation, 60 for testing
185 PAPERS • 9 BENCHMARKS
The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification. It comprises 2000 5s-clips of 50 different classes across natural, human and domestic sounds, again, drawn from Freesound.org.
185 PAPERS • 6 BENCHMARKS
The Middlebury Stereo dataset consists of high-resolution stereo sequences with complex geometry and pixel-accurate ground-truth disparity data. The ground-truth disparities are acquired using a novel technique that employs structured lighting and does not require the calibration of the light projectors.
185 PAPERS • 5 BENCHMARKS
WebText is an internal OpenAI corpus created by scraping web pages with emphasis on document quality. The authors scraped all outbound links from Reddit which received at least 3 karma. The authors used the approach as a heuristic indicator for whether other users found the link interesting, educational, or just funny.
184 PAPERS • NO BENCHMARKS YET
The HELEN dataset is composed of 2330 face images of 400×400 pixels with labeled facial components generated through manually-annotated contours along eyes, eyebrows, nose, lips and jawline.
183 PAPERS • 1 BENCHMARK