The Newsela dataset was introduced by Xu et al. in their research on text simplification. It is a corpus that includes thousands of news articles professionally leveled to different reading complexities. The dataset is used for academic research in fields such as text difficulty and text simplification. It is made available to academic partners upon request. The dataset is often used as a benchmark in the field of text simplification. Please note that the Newsela dataset is different from the NELA datasets, which are collections of news articles for the study of media bias and other applications.
104 PAPERS • 1 BENCHMARK
Multi-News, consists of news articles and human-written summaries of these articles from the site newser.com. Each summary is professionally written by editors and includes links to the original articles cited.
102 PAPERS • 6 BENCHMARKS
The APPS dataset consists of problems collected from different open-access coding websites such as Codeforces, Kattis, and more. The APPS benchmark attempts to mirror how humans programmers are evaluated by posing coding problems in unrestricted natural language and evaluating the correctness of solutions. The problems range in difficulty from introductory to collegiate competition level and measure coding ability as well as problem-solving.
100 PAPERS • 1 BENCHMARK
A hand-object interaction dataset with 3D pose annotations of hand and object. The dataset contains 66,034 training images and 11,524 test images from a total of 68 sequences. The sequences are captured in multi-camera and single-camera setups and contain 10 different subjects manipulating 10 different objects from YCB dataset. The annotations are automatically obtained using an optimization algorithm. The hand pose annotations for the test set are withheld and the accuracy of the algorithms on the test set can be evaluated with standard metrics using the CodaLab challenge submission(see project page). The object pose annotations for the test and train set are provided along with the dataset.
99 PAPERS • 2 BENCHMARKS
Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) is a large-scale reading comprehension dataset which requires commonsense reasoning. ReCoRD consists of queries automatically generated from CNN/Daily Mail news articles; the answer to each query is a text span from a summarizing passage of the corresponding news. The goal of ReCoRD is to evaluate a machine's ability of commonsense reasoning in reading comprehension. ReCoRD is pronounced as [ˈrɛkərd].
99 PAPERS • 1 BENCHMARK
Sentiment analysis is increasingly viewed as a vital task both from an academic and a commercial standpoint. The majority of current approaches, however, attempt to detect the overall polarity of a sentence, paragraph, or text span, regardless of the entities mentioned (e.g., laptops, restaurants) and their aspects (e.g., battery, screen; food, service). By contrast, this task is concerned with aspect based sentiment analysis (ABSA), where the goal is to identify the aspects of given target entities and the sentiment expressed towards each aspect. Datasets consisting of customer reviews with human-authored annotations identifying the mentioned aspects of the target entities and the sentiment polarity of each aspect will be provided.
99 PAPERS • 5 BENCHMARKS
The Visual Storytelling Dataset (VIST) consists of 210,819 unique photos and 50,000 stories. The images were collected from albums on Flickr. The albums included 10 to 50 images and all the images in an album are taken in a 48-hour span. The stories were created by workers on Amazon Mechanical Turk, where the workers were instructed to choose five images from the album and write a story about them. Every story has five sentences, and every sentence is paired with its appropriate image. The dataset is split into 3 subsets, a training set (80%), a validation set (10%) and a test set (10%). All the words and interpunction signs in the stories are separated by a space character and all the location names are replaced with the word location. All the names of people are replaced with the words male or female depending on the gender of the person.
98 PAPERS • 2 BENCHMARKS
We present ASDiv (Academia Sinica Diverse MWP Dataset), a diverse (in terms of both language patterns and problem types) English math word problem (MWP) corpus for evaluating the capability of various MWP solvers. Existing MWP corpora for studying AI progress remain limited either in language usage patterns or in problem types. We thus present a new English MWP corpus with 2,305 MWPs that cover more text patterns and most problem types taught in elementary school. Each MWP is annotated with its problem type and grade level (for indicating the level of difficulty). Furthermore, we propose a metric to measure the lexicon usage diversity of a given MWP corpus, and demonstrate that ASDiv is more diverse than existing corpora. Experiments show that our proposed corpus reflects the true capability of MWP solvers more faithfully.
97 PAPERS • 1 BENCHMARK
Dataset composed of online banking queries annotated with their corresponding intents.
97 PAPERS • 5 BENCHMARKS
This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository.
96 PAPERS • NO BENCHMARKS YET
Cholec80 is an endoscopic video dataset containing 80 videos of cholecystectomy surgeries performed by 13 surgeons. The videos are captured at 25 fps and downsampled to 1 fps for processing. The whole dataset is labeled with the phase and tool presence annotations. The phases have been defined by a senior surgeon in Strasbourg hospital, France. Since the tools are sometimes hardly visible in the images and thus difficult to be recognized visually, a tool is defined as present in an image if at least half of the tool tip is visible.
96 PAPERS • 2 BENCHMARKS
The FIGER dataset is an entity recognition dataset where entities are labelled using fine-grained system 112 tags, such as person/doctor, art/written_work and building/hotel. The tags are derivied from Freebase types. The training set consists of Wikipedia articles automatically annotated with distant supervision approach that utilizes the information encoded in anchor links. The test set was annotated manually.
The RVL-CDIP dataset consists of scanned document images belonging to 16 classes such as letter, form, email, resume, memo, etc. The dataset has 320,000 training, 40,000 validation and 40,000 test images. The images are characterized by low quality, noise, and low resolution, typically 100 dpi.
96 PAPERS • 4 BENCHMARKS
CARER is an emotion dataset collected through noisy labels, annotated via distant supervision as in (Go et al., 2009).
95 PAPERS • 7 BENCHMARKS
CLUE is a Chinese Language Understanding Evaluation benchmark. It consists of different NLU datasets. It is a community-driven project that brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension, all on original Chinese text.
95 PAPERS • 8 BENCHMARKS
HumanML3D is a 3D human motion-language dataset that originates from a combination of HumanAct12 and Amass dataset. It covers a broad range of human actions such as daily activities (e.g., 'walking', 'jumping'), sports (e.g., 'swimming', 'playing golf'), acrobatics (e.g., 'cartwheel') and artistry (e.g., 'dancing'). Overall, HumanML3D dataset consists of 14,616 motions and 44,970 descriptions composed by 5,371 distinct words. The total length of motions amounts to 28.59 hours. The average motion length is 7.1 seconds, while average description length is 12 words.
95 PAPERS • 2 BENCHMARKS
VATEX is multilingual, large, linguistically complex, and diverse dataset in terms of both video and natural language descriptions. It has two tasks for video-and-language research: (1) Multilingual Video Captioning, aimed at describing a video in various languages with a compact unified captioning model, and (2) Video-guided Machine Translation, to translate a source language description into the target language using the video information as additional spatiotemporal context.
95 PAPERS • 3 BENCHMARKS
Camouflaged Object (CAMO) dataset specifically designed for the task of camouflaged object segmentation. We focus on two categories, i.e., naturally camouflaged objects and artificially camouflaged objects, which usually correspond to animals and humans in the real world, respectively. Camouflaged object images consists of 1250 images (1000 images for the training set and 250 images for the testing set). Non-camouflaged object images are collected from the MS-COCO dataset (1000 images for the training set and 250 images for the testing set). CAMO has objectness mask ground-truth.
92 PAPERS • 2 BENCHMARKS
GAP is a gender-balanced dataset containing 8,908 coreference-labeled pairs of (ambiguous pronoun, antecedent name), sampled from Wikipedia and released by Google AI Language for the evaluation of coreference resolution in practical applications.
92 PAPERS • 1 BENCHMARK
Children’s Book Test (CBT) is designed to measure directly how well language models can exploit wider linguistic context. The CBT is built from books that are freely available thanks to Project Gutenberg.
90 PAPERS • 2 BENCHMARKS
The ConvAI2 NeurIPS competition aimed at finding approaches to creating high-quality dialogue agents capable of meaningful open domain conversation. The ConvAI2 dataset for training models is based on the PERSONA-CHAT dataset. The speaker pairs each have assigned profiles coming from a set of 1155 possible personas (at training time), each consisting of at least 5 profile sentences, setting aside 100 never seen before personas for validation. As the original PERSONA-CHAT test set was released, a new hidden test set consisted of 100 new personas and over 1,015 dialogs was created by crowdsourced workers.
89 PAPERS • 1 BENCHMARK
CosmosQA is a large-scale dataset of 35.6K problems that require commonsense-based reading comprehension, formulated as multiple-choice questions. It focuses on reading between the lines over a diverse collection of people’s everyday narratives, asking questions concerning on the likely causes or effects of events that require reasoning beyond the exact text spans in the context.
88 PAPERS • NO BENCHMARKS YET
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
88 PAPERS • 1 BENCHMARK
ReDial (Recommendation Dialogues) is an annotated dataset of dialogues, where users recommend movies to each other. The dataset consists of over 10,000 conversations centered around the theme of providing movie recommendations.
88 PAPERS • 2 BENCHMARKS
The ActivityNet-QA dataset contains 58,000 human-annotated QA pairs on 5,800 videos derived from the popular ActivityNet dataset. The dataset provides a benchmark for testing the performance of VideoQA models on long-term spatio-temporal reasoning.
85 PAPERS • 2 BENCHMARKS
ASPEC, Asian Scientific Paper Excerpt Corpus, is constructed by the Japan Science and Technology Agency (JST) in collaboration with the National Institute of Information and Communications Technology (NICT). It consists of a Japanese-English paper abstract corpus of 3M parallel sentences (ASPEC-JE) and a Japanese-Chinese paper excerpt corpus of 680K parallel sentences (ASPEC-JC). This corpus is one of the achievements of the Japanese-Chinese machine translation project which was run in Japan from 2006 to 2010.
84 PAPERS • NO BENCHMARKS YET
WikiMatrix is a dataset of parallel sentences in the textual content of Wikipedia for all possible language pairs. The mined data consists of:
The AI2’s Reasoning Challenge (ARC) dataset is a multiple-choice question-answering dataset, containing questions from science exams from grade 3 to grade 9. The dataset is split in two partitions: Easy and Challenge, where the latter partition contains the more difficult questions that require reasoning. Most of the questions have 4 answer choices, with <1% of all the questions having either 3 or 5 answer choices. ARC includes a supporting KB of 14.3M unstructured text passages.
83 PAPERS • 3 BENCHMARKS
MAWPS is an online repository of Math Word Problems, to provide a unified testbed to evaluate different algorithms. MAWPS allows for the automatic construction of datasets with particular characteristics, providing tools for tuning the lexical and template overlap of a dataset as well as for filtering ungrammatical problems from web-sourced corpora. The online nature of this repository facilitates easy community contribution. Amassed 3,320 problems, including the full datasets used in several previous works.
83 PAPERS • 2 BENCHMARKS
UAVDT is a large scale challenging UAV Detection and Tracking benchmark (i.e., about 80, 000 representative frames from 10 hours raw videos) for 3 important fundamental tasks, i.e., object DETection (DET), Single Object Tracking (SOT) and Multiple Object Tracking (MOT).
80 PAPERS • 1 BENCHMARK
The COCO-Text dataset is a dataset for text detection and recognition. It is based on the MS COCO dataset, which contains images of complex everyday scenes. The COCO-Text dataset contains non-text images, legible text images and illegible text images. In total there are 22184 training images and 7026 validation images with at least one instance of legible text.
79 PAPERS • 2 BENCHMARKS
KP20k is a large-scale scholarly articles dataset with 528K articles for training, 20K articles for validation and 20K articles for testing.
79 PAPERS • 3 BENCHMARKS
The MovieQA dataset is a dataset for movie question answering. to evaluate automatic story comprehension from both video and text. The data set consists of almost 15,000 multiple choice question answers obtained from over 400 movies and features high semantic diversity. Each question comes with a set of five highly plausible answers; only one of which is correct. The questions can be answered using multiple sources of information: movie clips, plots, subtitles, and for a subset scripts and DVS.
79 PAPERS • 1 BENCHMARK
A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets.
The WSJ0 Hipster Ambient Mixtures (WHAM!) dataset pairs each two-speaker mixture in the wsj0-2mix dataset with a unique noise background scene. It has an extension called WHAMR! that adds artificial reverberation to the speech signals in addition to the background noise.
79 PAPERS • 5 BENCHMARKS
VQG is a collection of datasets for visual question generation. VQG questions were collected by crowdsourcing the task on Amazon Mechanical Turk (AMT). The authors provided details on the prompt and the specific instructions for all the crowdsourcing tasks in this paper in the supplementary material. The prompt was successful at capturing nonliteral questions. Images were taken from the MSCOCO dataset.
77 PAPERS • 1 BENCHMARK
NLVR contains 92,244 pairs of human-written English sentences grounded in synthetic images. Because the images are synthetically generated, this dataset can be used for semantic parsing.
76 PAPERS • 3 BENCHMARKS
This dataset gathers 728,321 biographies from English Wikipedia. It aims at evaluating text generation algorithms. For each article, we provide the first paragraph and the infobox (both tokenized).
76 PAPERS • 1 BENCHMARK
CoNLL-2014 will continue the CoNLL tradition of having a high profile shared task in natural language processing. This year's shared task will be grammatical error correction, a continuation of the CoNLL shared task in 2013. A participating system in this shared task is given short English texts written by non-native speakers of English. The system detects the grammatical errors present in the input texts, and returns the corrected essays. The shared task in 2014 will require a participating system to correct all errors present in an essay (i.e., not restricted to just five error types in 2013). Also, the evaluation metric will be changed to F0.5, weighting precision twice as much as recall.
75 PAPERS • 2 BENCHMARKS
Our task is to localize and provide a pixel-level mask of an object on all video frames given a language referring expression obtained either by looking at the first frame only or the full video. To validate our approach we employ two popular video object segmentation datasets, DAVIS16 [38] and DAVIS17 [42]. These two datasets introduce various challenges, containing videos with single or multiple salient objects, crowded scenes, similar looking instances, occlusions, camera view changes, fast motion, etc.
75 PAPERS • 5 BENCHMARKS
NExT-QA is a VideoQA benchmark targeting the explanation of video contents. It challenges QA models to reason about the causal and temporal actions and understand the rich object interactions in daily activities. It supports both multi-choice and open-ended QA tasks. The videos are untrimmed and the questions usually invoke local video contents for answers.
74 PAPERS • 3 BENCHMARKS
GuessWhat?! is a large-scale dataset consisting of 150K human-played games with a total of 800K visual question-answer pairs on 66K images.
73 PAPERS • NO BENCHMARKS YET
The How2 dataset contains 13,500 videos, or 300 hours of speech, and is split into 185,187 training, 2022 development (dev), and 2361 test utterances. It has subtitles in English and crowdsourced Portuguese translations.
73 PAPERS • 2 BENCHMARKS
The ECB+ corpus is an extension to the EventCorefBank (ECB, Bejan and Harabagiu, 2010). A newly added corpus component consists of 502 documents that belong to the 43 topics of the ECB but that describe different seminal events than those already captured in the ECB. All corpus texts were found through Google Search and were annotated with mentions of events and their times, locations, human and non-human participants as well as with within- and cross-document event and entity coreference information. The 2012 version of annotation of the ECB corpus (Lee et al., 2012) was used as a starting point for re-annotation of the ECB according to the ECB+ annotation guideline.
72 PAPERS • 2 BENCHMARKS
ST-VQA aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process.
72 PAPERS • NO BENCHMARKS YET
Few-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities, and 4,601,223 tokens. Three benchmark tasks are built, one is supervised (Few-NERD (SUP)) and the other two are few-shot (Few-NERD (INTRA) and Few-NERD (INTER)).
71 PAPERS • 3 BENCHMARKS
TweetEval introduces an evaluation framework consisting of seven heterogeneous Twitter-specific classification tasks.
71 PAPERS • 2 BENCHMARKS
LogiQA consists of 8,678 QA instances, covering multiple types of deductive reasoning. Results show that state-of-the-art neural models perform by far worse than human ceiling. The dataset can also serve as a benchmark for reinvestigating logical AI under the deep learning NLP setting.
70 PAPERS • 1 BENCHMARK