NABirds V1 is a collection of 48,000 annotated photographs of the 400 species of birds that are commonly observed in North America. More than 100 photographs are available for each species, including separate annotations for males, females and juveniles that comprise 700 visual categories. This dataset is to be used for fine-grained visual categorization experiments.
113 PAPERS • 1 BENCHMARK
RESISC45 dataset is a dataset for Remote Sensing Image Scene Classification (RESISC). It contains 31,500 RGB images of size 256×256 divided into 45 scene classes, each class containing 700 images. Among its notable features, RESISC45 contains varying spatial resolution ranging from 20cm to more than 30m/px.
RobustBench is a benchmark of adversarial robustness, which as accurately as possible reflects the robustness of the considered models within a reasonable computational budget. To this end, we start by considering the image classification task and introduce restrictions (possibly loosened in the future) on the allowed models.
113 PAPERS • NO BENCHMARKS YET
SimpleQuestions is a large-scale factoid question answering dataset. It consists of 108,442 natural language questions, each paired with a corresponding fact from Freebase knowledge base. Each fact is a triple (subject, relation, object) and the answer to the question is always the object. The dataset is divided into training, validation, and test sets with 75,910, 10,845 and 21,687 questions respectively.
AFLW2000-3D is a dataset of 2000 images that have been annotated with image-level 68-point 3D facial landmarks. This dataset is used for evaluation of 3D facial landmark detection models. The head poses are very diverse and often hard to be detected by a CNN-based face detector.
112 PAPERS • 8 BENCHMARKS
The LUNA challenges provide datasets for automatic nodule detection algorithms using the largest publicly available reference database of chest CT scans, the LIDC-IDRI data set. In LUNA16, participants develop their algorithm and upload their predictions on 888 CT scans in one of the two tracks: 1) the complete nodule detection track where a complete CAD system should be developed, or 2) the false positive reduction track where a provided set of nodule candidates should be classified.
112 PAPERS • 2 BENCHMARKS
MCTest is a freely available set of stories and associated questions intended for research on the machine comprehension of text.
A large-scale natural dataset in English to measure stereotypical biases in four domains: gender, profession, race, and religion.
112 PAPERS • 1 BENCHMARK
mC4 is a multilingual variant of the C4 dataset called mC4. mC4 comprises natural text in 101 languages drawn from the public Common Crawl web scrape.
112 PAPERS • NO BENCHMARKS YET
Aff-Wild is a dataset for emotion recognition from facial images in a variety of head poses, illumination conditions and occlusions.
111 PAPERS • NO BENCHMARKS YET
DocVQA consists of 50,000 questions defined on 12,000+ document images.
111 PAPERS • 2 BENCHMARKS
The MRQA (Machine Reading for Question Answering) dataset is a dataset for evaluating the generalization capabilities of reading comprehension systems.
111 PAPERS • 1 BENCHMARK
CMU Panoptic is a large scale dataset providing 3D pose annotations (1.5 millions) for multiple people engaging social activities. It contains 65 videos (5.5 hours) with multi-view annotations, but only 17 of them are in multi-person scenario and have the camera parameters.
111 PAPERS • 4 BENCHMARKS
Permuted MNIST is an MNIST variant that consists of 70,000 images of handwritten digits from 0 to 9, where 60,000 images are used for training, and 10,000 images for test. The difference of this dataset from the original MNIST is that each of the ten tasks is the multi-class classification of a different random permutation of the input pixels.
SHAPES is a dataset of synthetic images designed to benchmark systems for understanding of spatial and logical relations among multiple objects. The dataset consists of complex questions about arrangements of colored shapes. The questions are built around compositions of concepts and relations, e.g. Is there a red shape above a circle? or Is a red shape blue?. Questions contain between two and four attributes, object types, or relationships. There are 244 questions and 15,616 images in total, with all questions having a yes and no answer (and corresponding supporting image). This eliminates the risk of learning biases.
VOT2016 is a video dataset for visual object tracking. It contains 60 video clips and 21,646 corresponding ground truth maps with pixel-wise annotation of salient objects.
This shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions. Named entities form the basis of many modern approaches to other tasks (like event clustering and summarisation), but recall on them is a real problem in noisy text - even among annotators. This drop tends to be due to novel entities and surface forms. Take for example the tweet “so.. kktny in 30 mins?” - even human experts find entity kktny hard to detect and resolve. This task will evaluate the ability to detect and classify novel, emerging, singleton named entities in noisy text.
The Caltech Occluded Faces in the Wild (COFW) dataset is designed to present faces in real-world conditions. Faces show large variations in shape and occlusions due to differences in pose, expression, use of accessories such as sunglasses and hats and interactions with objects (e.g. food, hands, microphones, etc.). All images were hand annotated using the same 29 landmarks as in LFPW. Both the landmark positions as well as their occluded/unoccluded state were annotated. The faces are occluded to different degrees, with large variations in the type of occlusions encountered. COFW has an average occlusion of over 23.
110 PAPERS • 5 BENCHMARKS
CORe50 is a dataset designed for assessing Continual Learning techniques in an Object Recognition context.
110 PAPERS • NO BENCHMARKS YET
OTB2013 is the previous version of the current OTB2015 Visual Tracker Benchmark. It contains only 50 tracking sequences, as opposed to the 100 sequences in the current version of the benchmark.
110 PAPERS • 2 BENCHMARKS
Sensory ecologists have found that this s background matching camouflage strategy works by deceiving the visual perceptual system of the observer. Naturally, addressing concealed object detection (COD) requires a significant amount of visual perception knowledge. Understanding COD has not only scientific value in itself, but it also important for applications in many fundamental fields, such as computer vision (e.g., for search-and-rescue work, or rare species discovery), medicine (e.g., polyp segmentation, lung infection segmentation), agriculture (e.g., locust detection to prevent invasion), and art (e.g., recreational art). The high intrinsic similarities between the targets and non-targets make COD far more challenging than traditional object segmentation/detection. Although it has gained increased attention recently, studies on COD still remain scarce, mainly due to the lack of a sufficiently large dataset and a standard benchmark like Pascal-VOC, ImageNet, MS-COCO, ADE20K, and DA
109 PAPERS • 2 BENCHMARKS
The Chairs dataset contains rendered images of around 1000 different three-dimensional chair models.
109 PAPERS • 1 BENCHMARK
SCC Data Set
109 PAPERS • 3 BENCHMARKS
Visual Entailment (VE) consists of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal of a trained VE model is to predict whether the image semantically entails the text. SNLI-VE is a dataset for VE which is based on the Stanford Natural Language Inference corpus and Flickr30k dataset.
LIAR is a publicly available dataset for fake news detection. A decade-long of 12.8K manually labeled short statements were collected in various contexts from POLITIFACT.COM, which provides detailed analysis report and links to source documents for each case. This dataset can be used for fact-checking research as well. Notably, this new dataset is an order of magnitude larger than previously largest public fake news datasets of similar type. The LIAR dataset4 includes 12.8K human labeled short statements from POLITIFACT.COM’s API, and each statement is evaluated by a POLITIFACT.COM editor for its truthfulness.
108 PAPERS • 1 BENCHMARK
The NLPR dataset for salient object detection consists of 1,000 image pairs captured by a standard Microsoft Kinect with a resolution of 640×480. The images include indoor and outdoor scenes (e.g., offices, campuses, streets and supermarkets).
The UCF-Crime dataset is a large-scale dataset of 128 hours of videos. It consists of 1900 long and untrimmed real-world surveillance videos, with 13 realistic anomalies including Abuse, Arrest, Arson, Assault, Road Accident, Burglary, Explosion, Fighting, Robbery, Shooting, Stealing, Shoplifting, and Vandalism. These anomalies are selected because they have a significant impact on public safety.
WinoBias contains 3,160 sentences, split equally for development and test, created by researchers familiar with the project. Sentences were created to follow two prototypical templates but annotators were encouraged to come up with scenarios where entities could be interacting in plausible ways. Templates were selected to be challenging and designed to cover cases requiring semantics and syntax separately.
108 PAPERS • NO BENCHMARKS YET
Argoverse 2 (AV2) is a collection of three datasets for perception and forecasting research in the self-driving domain. The annotated Sensor Dataset contains 1,000 sequences of multimodal data, encompassing high-resolution imagery from seven ring cameras, and two stereo cameras in addition to lidar point clouds, and 6-DOF map-aligned pose. Sequences contain 3D cuboid annotations for 26 object categories, all of which are sufficiently-sampled to support training and evaluation of 3D perception models. The Lidar Dataset contains 20,000 sequences of unlabeled lidar point clouds and map-aligned pose. This dataset is the largest ever collection of lidar sensor data and supports self-supervised learning and the emerging task of point cloud forecasting. Finally, the Motion Forecasting Dataset contains 250,000 scenarios mined for interesting and challenging interactions be- tween the autonomous vehicle and other actors in each local scene. Models are tasked with the prediction of future motion
107 PAPERS • 3 BENCHMARKS
The Evaluation framework of Raganato et al. 2017 includes two training sets (SemCor-Miller et al., 1993- and OMSTI-Taghipour and Ng, 2015-) and five test sets from the Senseval/SemEval series (Edmonds and Cotton, 2001; Snyder and Palmer, 2004; Pradhan et al., 2007; Navigli et al., 2013; Moro and Navigli, 2015), standardized to the same format and sense inventory (i.e. WordNet 3.0).
We observe that satellite imagery is a powerful source of information as it contains more structured and uniform data, compared to traditional images. Although computer vision community has been accomplishing hard tasks on everyday image datasets using deep learning, satellite images are only recently gaining attention for maps and population analysis. This workshop aims at bringing together a diverse set of researchers to advance the state-of-the-art in satellite image analysis.
106 PAPERS • 2 BENCHMARKS
The FER+ dataset is an extension of the original FER dataset, where the images have been re-labelled into one of 8 emotion types: neutral, happiness, surprise, sadness, anger, disgust, fear, and contempt.
A new dataset with abstractive dialogue summaries.
106 PAPERS • 8 BENCHMARKS
A dataset of large scale alignments between Wikipedia abstracts and Wikidata triples. T-REx consists of 11 million triples aligned with 3.09 million Wikipedia abstracts (6.2 million sentences).
WikiHow is a dataset of more than 230,000 article and summary pairs extracted and constructed from an online knowledge base written by different human authors. The articles span a wide range of topics and represent high diversity styles.
106 PAPERS • 3 BENCHMARKS
Functional Map of the World (fMoW) is a dataset that aims to inspire the development of machine learning models capable of predicting the functional purpose of buildings and land use from temporal sequences of satellite images and a rich set of metadata features.
106 PAPERS • NO BENCHMARKS YET
The smallNORB dataset is a datset for 3D object recognition from shape. It contains images of 50 toys belonging to 5 generic categories: four-legged animals, human figures, airplanes, trucks, and cars. The objects were imaged by two cameras under 6 lighting conditions, 9 elevations (30 to 70 degrees every 5 degrees), and 18 azimuths (0 to 340 every 20 degrees). The training set is composed of 5 instances of each category (instances 4, 6, 7, 8 and 9), and the test set of the remaining 5 instances (instances 0, 1, 2, 3, and 5).
106 PAPERS • 1 BENCHMARK
The Georgia Tech Egocentric Activities (GTEA) dataset contains seven types of daily activities such as making sandwich, tea, or coffee. Each activity is performed by four different people, thus totally 28 videos. For each video, there are about 20 fine-grained action instances such as take bread, pour ketchup, in approximately one minute.
105 PAPERS • 2 BENCHMARKS
NELL-995 KG Completion Dataset
A-OKVQA is crowdsourced visual question answering dataset composed of a diverse set of about 25K questions requiring a broad base of commonsense and world knowledge to answer.
104 PAPERS • 1 BENCHMARK
Conceptual 12M (CC12M) is a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training.
104 PAPERS • NO BENCHMARKS YET
CrowS-Pairs has 1508 examples that cover stereotypes dealing with nine types of bias, like race, religion, and age. In CrowS-Pairs a model is presented with two sentences: one that is more stereotyping and another that is less stereotyping. The data focuses on stereotypes about historically disadvantaged groups and contrasts them with advantaged groups.
The DFDC (Deepfake Detection Challenge) is a dataset for deepface detection consisting of more than 100,000 videos.
FreiHAND is a 3D hand pose dataset which records different hand actions performed by 32 people. For each hand image, MANO-based 3D hand pose annotations are provided. It currently contains 32,560 unique training samples and 3960 unique samples for evaluation. The training samples are recorded with a green screen background allowing for background removal. In addition, it applies three different post processing strategies to training samples for data augmentation. However, these post processing strategies are not applied to evaluation samples.
The Newsela dataset was introduced by Xu et al. in their research on text simplification. It is a corpus that includes thousands of news articles professionally leveled to different reading complexities. The dataset is used for academic research in fields such as text difficulty and text simplification. It is made available to academic partners upon request. The dataset is often used as a benchmark in the field of text simplification. Please note that the Newsela dataset is different from the NELA datasets, which are collections of news articles for the study of media bias and other applications.
GoEmotions is a corpus of 58k carefully curated comments extracted from Reddit, with human annotations to 27 emotion categories or Neutral.
102 PAPERS • 3 BENCHMARKS
Multi-News, consists of news articles and human-written summaries of these articles from the site newser.com. Each summary is professionally written by editors and includes links to the original articles cited.
102 PAPERS • 6 BENCHMARKS
The Penn Action Dataset contains 2326 video sequences of 15 different actions and human joint annotations for each sequence.
102 PAPERS • 4 BENCHMARKS