HotpotQA is a question answering dataset collected on the English Wikipedia, containing about 113K crowd-sourced questions that are constructed to require the introduction paragraphs of two Wikipedia articles to answer. Each question in the dataset comes with the two gold paragraphs, as well as a list of sentences in these paragraphs that crowdworkers identify as supporting facts necessary to answer the question.
325 PAPERS • 5 BENCHMARKS
The Set5 dataset is a dataset consisting of 5 images (“baby”, “bird”, “butterfly”, “head”, “woman”) commonly used for testing performance of Image Super-Resolution models. Image Source: http://people.rennes.inria.fr/Aline.Roumy/results/SR_BMVC12.html
322 PAPERS • 8 BENCHMARKS
The CASIA-WebFace dataset is used for face verification and face identification tasks. The dataset contains 494,414 face images of 10,575 real identities collected from the web.
321 PAPERS • 2 BENCHMARKS
MSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 video clips from 20 categories, and each video clip is annotated with 20 English sentences by Amazon Mechanical Turks. There are about 29,000 unique words in all captions. The standard splits uses 6,513 clips for training, 497 clips for validation, and 2,990 clips for testing.
319 PAPERS • 10 BENCHMARKS
The Arcade Learning Environment (ALE) is an object-oriented framework that allows researchers to develop AI agents for Atari 2600 games. It is built on top of the Atari 2600 emulator Stella and separates the details of emulation from agent design.
316 PAPERS • 57 BENCHMARKS
FB15k-237 is a link prediction dataset created from FB15k. While FB15k consists of 1,345 relations, 14,951 entities, and 592,213 triples, many triples are inverses that cause leakage from the training to testing and validation splits. FB15k-237 was created by Toutanova and Chen (2015) to ensure that the testing and evaluation datasets do not have inverse relation test leakage. In summary, FB15k-237 dataset contains 310,079 triples with 14,505 entities and 237 relation types.
313 PAPERS • 3 BENCHMARKS
DomainNet is a dataset of common objects in six different domain. All domains include 345 categories (classes) of objects such as Bracelet, plane, bird and cello. The domains include clipart: collection of clipart images; real: photos and real world images; sketch: sketches of specific objects; infograph: infographic images with specific object; painting artistic depictions of objects in the form of paintings and quickdraw: drawings of the worldwide players of game “Quick Draw!”.
312 PAPERS • 7 BENCHMARKS
NTU RGB+D is a large-scale dataset for RGB-D human action recognition. It involves 56,880 samples of 60 action classes collected from 40 subjects. The actions can be generally divided into three categories: 40 daily actions (e.g., drinking, eating, reading), nine health-related actions (e.g., sneezing, staggering, falling down), and 11 mutual actions (e.g., punching, kicking, hugging). These actions take place under 17 different scene conditions corresponding to 17 video sequences (i.e., S001–S017). The actions were captured using three cameras with different horizontal imaging viewpoints, namely, −45∘,0∘, and +45∘. Multi-modality information is provided for action characterization, including depth maps, 3D skeleton joint position, RGB frames, and infrared sequences. The performance evaluation is performed by a cross-subject test that split the 40 subjects into training and test groups, and by a cross-view test that employed one camera (+45∘) for testing, and the other two cameras for
308 PAPERS • 18 BENCHMARKS
The SUN RGBD dataset contains 10335 real RGB-D images of room scenes. Each RGB image has a corresponding depth and segmentation map. As many as 700 object categories are labeled. The training and testing sets contain 5285 and 5050 images, respectively.
302 PAPERS • 12 BENCHMARKS
The Charades dataset is composed of 9,848 videos of daily indoors activities with an average length of 30 seconds, involving interactions with 46 objects classes in 15 types of indoor scenes and containing a vocabulary of 30 verbs leading to 157 action classes. Each video in this dataset is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacting objects. 267 different users were presented with a sentence, which includes objects and actions from a fixed vocabulary, and they recorded a video acting out the sentence. In total, the dataset contains 66,500 temporal annotations for 157 action classes, 41,104 labels for 46 object classes, and 27,847 textual descriptions of the videos. In the standard split there are7,986 training video and 1,863 validation video.
301 PAPERS • 5 BENCHMARKS
The GTA5 dataset contains 24966 synthetic images with pixel level semantic annotation. The images have been rendered using the open-world video game Grand Theft Auto 5 and are all from the car perspective in the streets of American-style virtual cities. There are 19 semantic classes which are compatible with the ones of Cityscapes dataset.
301 PAPERS • 7 BENCHMARKS
The Food-101 dataset consists of 101 food categories with 750 training and 250 test images per category, making a total of 101k images. The labels for the test images have been manually cleaned, while the training set contains some noise.
299 PAPERS • 10 BENCHMARKS
ImageNet-C is an open source data set that consists of algorithmically generated corruptions (blur, noise) applied to the ImageNet test-set.
299 PAPERS • 3 BENCHMARKS
The CheXpert dataset contains 224,316 chest radiographs of 65,240 patients with both frontal and lateral views available. The task is to do automated chest x-ray interpretation, featuring uncertainty labels and radiologist-labeled reference standard evaluation sets.
293 PAPERS • 1 BENCHMARK
The RCV1 dataset is a benchmark dataset on text categorization. It is a collection of newswire articles producd by Reuters in 1996-1997. It contains 804,414 manually labeled newswire documents, and categorized with respect to three controlled vocabularies: industries, topics and regions.
293 PAPERS • 5 BENCHMARKS
The Stanford 3D Indoor Scene Dataset (S3DIS) dataset contains 6 large-scale indoor areas with 271 rooms. Each point in the scene point cloud is annotated with one of the 13 semantic categories.
293 PAPERS • 7 BENCHMARKS
The DukeMTMC-reID (Duke Multi-Tracking Multi-Camera ReIDentification) dataset is a subset of the DukeMTMC for image-based person re-ID. The dataset is created from high-resolution videos from 8 different cameras. It is one of the largest pedestrian image datasets wherein images are cropped by hand-drawn bounding boxes. The dataset consists 16,522 training images of 702 identities, 2,228 query images of the other 702 identities and 17,661 gallery images.
292 PAPERS • 6 BENCHMARKS
The Multi-PIE (Multi Pose, Illumination, Expressions) dataset consists of face images of 337 subjects taken under different pose, illumination and expressions. The pose range contains 15 discrete views, capturing a face profile-to-profile. Illumination changes were modeled using 19 flashlights located in different places of the room.
284 PAPERS • 1 BENCHMARK
SemanticKITTI is a large-scale outdoor-scene dataset for point cloud semantic segmentation. It is derived from the KITTI Vision Odometry Benchmark which it extends with dense point-wise annotations for the complete 360 field-of-view of the employed automotive LiDAR. The dataset consists of 22 sequences. Overall, the dataset provides 23201 point clouds for training and 20351 for testing.
284 PAPERS • 9 BENCHMARKS
Yet Another Great Ontology (YAGO) is a Knowledge Graph that augments WordNet with common knowledge facts extracted from Wikipedia, converting WordNet from a primarily linguistic resource to a common knowledge base. YAGO originally consisted of more than 1 million entities and 5 million facts describing relationships between these entities. YAGO2 grounded entities, facts, and events in time and space, contained 446 million facts about 9.8 million entities, while YAGO3 added about 1 million more entities from non-English Wikipedia articles. YAGO3-10 a subset of YAGO3, containing entities which have a minimum of 10 relations each.
284 PAPERS • 7 BENCHMARKS
DeepFashion is a dataset containing around 800K diverse fashion images with their rich annotations (46 categories, 1,000 descriptive attributes, bounding boxes and landmark information) ranging from well-posed product images to real-world-like consumer photos.
280 PAPERS • 4 BENCHMARKS
The iNaturalist 2017 dataset (iNat) contains 675,170 training and validation images from 5,089 natural fine-grained categories. Those categories belong to 13 super-categories including Plantae (Plant), Insecta (Insect), Aves (Bird), Mammalia (Mammal), and so on. The iNat dataset is highly imbalanced with dramatically different number of images per category. For example, the largest super-category “Plantae (Plant)” has 196,613 images from 2,101 categories; whereas the smallest super-category “Protozoa” only has 381 images from 4 categories.
279 PAPERS • 6 BENCHMARKS
The NUS-WIDE dataset contains 269,648 images with a total of 5,018 tags collected from Flickr. These images are manually annotated with 81 concepts, including objects and scenes.
278 PAPERS • 3 BENCHMARKS
FEVER is a publicly available dataset for fact extraction and verification against textual sources.
275 PAPERS • 3 BENCHMARKS
The PASCAL Visual Object Classes (VOC) 2012 dataset contains 20 object categories including vehicles, household, animals, and other: aeroplane, bicycle, boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa, TV/monitor, bird, cat, cow, dog, horse, sheep, and person. Each image in this dataset has pixel-level segmentation annotations, bounding box annotations, and object class annotations. This dataset has been widely used as a benchmark for object detection, semantic segmentation, and classification tasks. The PASCAL VOC dataset is split into three subsets: 1,464 images for training, 1,449 images for validation and a private testing set.
271 PAPERS • 51 BENCHMARKS
DailyDialog is a high-quality multi-turn open-domain English dialog dataset. It contains 13,118 dialogues split into a training set with 11,118 dialogues and validation and test sets with 1000 dialogues each. On average there are around 8 speaker turns per dialogue with around 15 tokens per turn.
270 PAPERS • 2 BENCHMARKS
The Matterport3D dataset is a large RGB-D dataset for scene understanding in indoor environments. It contains 10,800 panoramic views inside 90 real building-scale scenes, constructed from 194,400 RGB-D images. Each scene is a residential building consisting of multiple rooms and floor levels, and is annotated with surface construction, camera poses, and semantic segmentation.
268 PAPERS • 4 BENCHMARKS
The MPQA Opinion Corpus contains 535 news articles from a wide variety of news sources manually annotated for opinions and other private states (i.e., beliefs, emotions, sentiments, speculations, etc.).
266 PAPERS • 3 BENCHMARKS
WN18RR is a link prediction dataset created from WN18, which is a subset of WordNet. WN18 consists of 18 relations and 40,943 entities. However, many text triples are obtained by inverting triples from the training set. Thus the WN18RR dataset is created to ensure that the evaluation dataset does not have inverse relation test leakage. In summary, WN18RR dataset contains 93,003 triples with 40,943 entities and 11 relation types.
265 PAPERS • 3 BENCHMARKS
The ReAding Comprehension dataset from Examinations (RACE) dataset is a machine reading comprehension dataset consisting of 27,933 passages and 97,867 questions from English exams, targeting Chinese students aged 12-18. RACE consists of two subsets, RACE-M and RACE-H, from middle school and high school exams, respectively. RACE-M has 28,293 questions and RACE-H has 69,574. Each question is associated with 4 candidate answers, one of which is correct. The data generation process of RACE differs from most machine reading comprehension datasets - instead of generating questions and answers by heuristics or crowd-sourcing, questions in RACE are specifically designed for testing human reading skills, and are created by domain experts.
263 PAPERS • 4 BENCHMARKS
The Multi-domain Wizard-of-Oz (MultiWOZ) dataset is a large-scale human-human conversational corpus spanning over seven domains, containing 8438 multi-turn dialogues, with each dialogue averaging 14 turns. Different from existing standard datasets like WOZ and DSTC2, which contain less than 10 slots and only a few hundred values, MultiWOZ has 30 (domain, slot) pairs and over 4,500 possible values. The dialogues span seven domains: restaurant, hotel, attraction, taxi, train, hospital and police.
262 PAPERS • 15 BENCHMARKS
The Cross-lingual Natural Language Inference (XNLI) corpus is the extension of the Multi-Genre NLI (MultiNLI) corpus to 15 languages. The dataset was created by manually translating the validation and test sets of MultiNLI into each of those 15 languages. The English training set was machine translated for all languages. The dataset is composed of 122k train, 2490 validation and 5010 test examples.
262 PAPERS • 11 BENCHMARKS
Visual Question Answering (VQA) v2.0 is a dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. It is the second version of the VQA dataset.
261 PAPERS • 3 BENCHMARKS
The tieredImageNet dataset is a larger subset of ILSVRC-12 with 608 classes (779,165 images) grouped into 34 higher-level nodes in the ImageNet human-curated hierarchy. This set of nodes is partitioned into 20, 6, and 8 disjoint sets of training, validation, and testing nodes, and the corresponding classes form the respective meta-sets. As argued in Ren et al. (2018), this split near the root of the ImageNet hierarchy results in a more challenging, yet realistic regime with test classes that are less similar to training classes.
259 PAPERS • 7 BENCHMARKS
This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive. The newspaper texts were taken from Herald Glasgow, with permission from Herald & Times Group. Each speaker has a different set of the newspaper texts selected based a greedy algorithm that increases the contextual and phonetic coverage. The details of the text selection algorithms are described in the following paper: C. Veaux, J. Yamagishi and S. King, "The voice bank corpus: Design, collection and data analysis of a large regional accent speech database," https://doi.org/10.1109/ICSDA.2013.6709856. The rainbow passage and elicitation paragraph are the same for all speakers. The rainbow passage can be found at International Dialects of English Archive: (http://web.ku.edu/~idea/readings/rainbow.htm). The elicitation paragraph
255 PAPERS • 4 BENCHMARKS
PROTEINS is a dataset of proteins that are classified as enzymes or non-enzymes. Nodes represent the amino acids and two nodes are connected by an edge if they are less than 6 Angstroms apart.
252 PAPERS • 1 BENCHMARK
protein roles—in terms of their cellular functions from gene ontology—in various protein-protein interaction (PPI) graphs, with each graph corresponding to a different human tissue . positional gene sets are used, motif gene sets and immunological signatures as features and gene ontology sets as labels (121 in total), collected from the Molecular Signatures Database . The average graph contains 2373 nodes, with an average degree of 28.8.
249 PAPERS • 2 BENCHMARKS
The THUMOS14 dataset is a large-scale video dataset that includes 1,010 videos for validation and 1,574 videos for testing from 20 classes. Among all the videos, there are 220 and 212 videos with temporal annotations in validation and testing set, respectively.
243 PAPERS • 17 BENCHMARKS
C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models.
242 PAPERS • 2 BENCHMARKS
BookCorpus is a large collection of free novel books written by unpublished authors, which contains 11,038 books (around 74M sentences and 1G words) of 16 different sub-genres (e.g., Romance, Historical, Adventure, etc.).
240 PAPERS • NO BENCHMARKS YET
FGVC-Aircraft contains 10,200 images of aircraft, with 100 images for each of 102 different aircraft model variants, most of which are airplanes. The (main) aircraft in each image is annotated with a tight bounding box and a hierarchical airplane model label. Aircraft models are organized in a four-levels hierarchy. The four levels, from finer to coarser, are:
238 PAPERS • 7 BENCHMARKS
Speech Commands is an audio dataset of spoken words designed to help train and evaluate keyword spotting systems.
237 PAPERS • 4 BENCHMARKS
The German Traffic Sign Recognition Benchmark (GTSRB) contains 43 classes of traffic signs, split into 39,209 training images and 12,630 test images. The images have varying light conditions and rich backgrounds.
235 PAPERS • 3 BENCHMARKS
Animals with Attributes (AwA) was a dataset for benchmarking transfer-learning algorithms, in particular attribute base classification. It consisted of 30475 images of 50 animals classes with six pre-extracted feature representations for each image. The animals classes are aligned with Osherson's classical class/attribute matrix, thereby providing 85 numeric attribute values for each class. Using the shared attributes, it is possible to transfer information between different classes. The Animals with Attributes dataset was suspended. Its images are not available anymore because of copyright restrictions. A drop-in replacement, Animals with Attributes 2, is available instead.
233 PAPERS • 6 BENCHMARKS
The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words.
228 PAPERS • 13 BENCHMARKS
PASCAL-S is a dataset for salient object detection consisting of a set of 850 images from PASCAL VOC 2010 validation set with multiple salient objects on the scenes.
227 PAPERS • 3 BENCHMARKS
The efforts to create a non-trivial and publicly available dataset for action recognition was initiated at the KTH Royal Institute of Technology in 2004. The KTH dataset is one of the most standard datasets, which contains six actions: walk, jog, run, box, hand-wave, and hand clap. To account for performance nuance, each action is performed by 25 different individuals, and the setting is systematically altered for each action per actor. Setting variations include: outdoor (s1), outdoor with scale variation (s2), outdoor with different clothes (s3), and indoor (s4). These variations test the ability of each algorithm to identify actions independent of the background, appearance of the actors, and the scale of the actors.
226 PAPERS • 2 BENCHMARKS
Automatic image captioning is the task of producing a natural-language utterance (usually a sentence) that correctly reflects the visual content of an image. Up to this point, the resource most used for this task was the MS-COCO dataset, containing around 120,000 images and 5-way image-caption annotations (produced by paid annotators).
224 PAPERS • 1 BENCHMARK