DuoRC contains 186,089 unique question-answer pairs created from a collection of 7680 pairs of movie plots where each pair in the collection reflects two versions of the same movie.
42 PAPERS • 1 BENCHMARK
EmoContext consists of three-turn English Tweets. The emotion labels include happiness, sadness, anger and other.
EmotionLines contains a total of 29245 labeled utterances from 2000 dialogues. Each utterance in dialogues is labeled with one of seven emotions, six Ekman’s basic emotions plus the neutral emotion. Each labeling was accomplished by 5 workers, and for each utterance in a label, the emotion category with the highest votes was set as the label of the utterance. Those utterances voted as more than two different emotions were put into the non-neutral category. Therefore the dataset has a total of 8 types of emotion labels, anger, disgust, fear, happiness, sadness, surprise, neutral, and non-neutral.
English Web Treebank is a dataset containing 254,830 word-level tokens and 16,624 sentence-level tokens of webtext in 1174 files annotated for sentence- and word-level tokenization, part-of-speech, and syntactic structure. The data is roughly evenly divided across five genres: weblogs, newsgroups, email, reviews, and question-answers. The files were manually annotated following the sentence-level tokenization guidelines for web text and the word-level tokenization guidelines developed for English treebanks in the DARPA GALE project. Only text from the subject line and message body of posts, articles, messages and question-answers were collected and annotated.
42 PAPERS • NO BENCHMARKS YET
EntailmentBank is a dataset that contains multistep entailment trees. At each node in the tree (typically) two or more facts compose together to produce a new conclusion. Given a hypothesis (question + answer), three increasingly difficult explanation tasks are defined: generate a valid entailment tree given (a) all relevant sentences (the leaves of the gold entailment tree) (b) all relevant and some irrelevant sentences (c) a corpus.
Generation, Evaluation, and Metrics (GEM) is a benchmark environment for Natural Language Generation with a focus on its Evaluation, both through human annotations and automated Metrics.
C3 is a free-form multiple-Choice Chinese machine reading Comprehension dataset.
41 PAPERS • 3 BENCHMARKS
DART is a large dataset for open-domain structured data record to text generation. DART consists of 82,191 examples across different domains with each input being a semantic RDF triple set derived from data records in tables and the tree ontology of the schema, annotated with sentence descriptions that cover all facts in the triple set.
The Implicit Hate corpus is a dataset for hate speech detection with fine-grained labels for each message and its implication. This dataset contains 22,056 tweets from the most prominent extremist groups in the United States; 6,346 of these tweets contain implicit hate speech.
41 PAPERS • NO BENCHMARKS YET
MultiCoNER is a large multilingual dataset (11 languages) for Named Entity Recognition. It is designed to represent some of the contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities such as movie titles, and long-tail entity distributions.
QA-SRL was proposed as an open schema for semantic roles, in which the relation between an argument and a predicate is expressed as a natural-language question containing the predicate (“Where was someone educated?”) whose answer is the argument (“Princeton”). The authors collected about 19,000 question-answer pairs from 3,200 sentences.
The SCUT-CTW1500 dataset contains 1,500 images: 1,000 for training and 500 for testing. In particular, it provides 10,751 cropped text instance images, including 3,530 with curved text. The images are manually harvested from the Internet, image libraries such as Google Open-Image, or phone cameras. The dataset contains a lot of horizontal and multi-oriented text.
ShARC is a Conversational Question Answering dataset focussing on question answering from texts containing rules.
The Visual Spatial Reasoning (VSR) corpus is a collection of caption-image pairs with true/false labels. Each caption describes the spatial relation of two individual objects in the image, and a vision-language model (VLM) needs to judge whether the caption is correctly describing the image (True) or not (False).
41 PAPERS • 1 BENCHMARK
XL-Sum is a comprehensive and diverse dataset for abstractive summarization comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.
A large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community.
40 PAPERS • 5 BENCHMARKS
OCNLI stands for Original Chinese Natural Language Inference. It is corpus for Chinese Natural Language Inference, collected following closely the procedures of MNLI, but with enhanced strategies aiming for more challenging inference pairs. No human/machine translation is used in creating the dataset, and thus the Chinese texts are original and not translated.
40 PAPERS • 3 BENCHMARKS
SciDocs evaluation framework consists of a suite of evaluation tasks designed for document-level tasks.
Algebra Question Answering with Rationales (AQUA-RAT) is a dataset that contains algebraic word problems with rationales. The dataset consists of about 100,000 algebraic word problems with natural language rationales. Each problem is a json object consisting of four parts: * question - A natural language definition of the problem to solve * options - 5 possible options (A, B, C, D and E), among which one is correct * rationale - A natural language description of the solution to the problem * correct - The correct option
39 PAPERS • NO BENCHMARKS YET
WikiMovies is a dataset for question answering for movies content. It contains ~100k questions in the movie domain, and was designed to be answerable by using either a perfect KB (based on OMDb),
DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 dialogues with corresponding manually labeled summaries and topics.
38 PAPERS • 2 BENCHMARKS
FigureQA is a visual reasoning corpus of over one million question-answer pairs grounded in over 100,000 images. The images are synthetic, scientific-style figures from five classes: line plots, dot-line plots, vertical and horizontal bar graphs, and pie charts.
38 PAPERS • 1 BENCHMARK
InsuranceQA is a question answering dataset for the insurance domain, the data stemming from the website Insurance Library. There are 12,889 questions and 21,325 answers in the training set. There are 2,000 questions and 3,354 answers in the validation set. There are 2,000 questions and 3,308 answers in the test set.
38 PAPERS • NO BENCHMARKS YET
Multilingual Grade School Math Benchmark (MGSM) is a benchmark of grade-school math problems. The same 250 problems from GSM8K are each translated via human annotators in 10 languages. GSM8K (Grade School Math 8K) is a dataset of 8.5K high-quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.
ToolBench is an instruction-tuning dataset for tool use, which is created automatically using ChatGPT. Specifically, the authors collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub, then prompt ChatgPT to generate diverse human instructions involving these APIs, covering both single-tool and multi-tool scenarios.
BEAT has i) 76 hours, high-quality, multi-modal data captured from 30 speakers talking with eight different emotions and in four different languages, ii) 32 millions frame-level emotion and semantic relevance annotations. Our statistical analysis on BEAT demonstrates the correlation of conversational gestures with \textit{facial expressions}, \textit{emotions}, and \textit{semantics}, in addition to the known correlation with \textit{audio}, \textit{text}, and \textit{speaker identity}. Based on this observation, we propose a baseline model, \textbf{Ca}scaded \textbf{M}otion \textbf{N}etwork \textbf{(CaMN)}, which consists of above six modalities modeled in a cascaded architecture for gesture synthesis. To evaluate the semantic relevancy, we introduce a metric, Semantic Relevance Gesture Recall (\textbf{SRGR}). Qualitative and quantitative experiments demonstrate metrics' validness, ground truth data quality, and baseline's state-of-the-art performance. To the best of our knowledge,
37 PAPERS • 1 BENCHMARK
The BUCC mining task is a shared task on parallel sentence extraction from two monolingual corpora with a subset of them assumed to be parallel, and that has been available since 2016. For each language pair, the shared task provides a monolingual corpus for each language and a gold mapping list containing true translation pairs. These pairs are the ground truth. The task is to construct a list of translation pairs from the monolingual corpora. The constructed list is compared to the ground truth, and evaluated in terms of the F1 measure.
37 PAPERS • 4 BENCHMARKS
DeepFix consists of a program repair dataset (fix compiler errors in C programs). It enables research around automatically fixing programming errors using deep learning.
We release E-commerce Dialogue Corpus, comprising a training data set, a development set and a test set for retrieval based chatbot. The statistics of E-commerical Conversation Corpus are shown in the following table.
Multilingual Knowledge Questions and Answers (MKQA) is an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). The goal of this dataset is to provide a challenging benchmark for question answering quality across a wide set of languages. Answers are based on a language-independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering.
37 PAPERS • NO BENCHMARKS YET
The Memetracker corpus contains articles from mainstream media and blogs from August 1 to October 31, 2008 with about 1 million documents per day. It has 10,967 hyperlink cascades among 600 media sites.
RadGraph is a dataset of entities and relations in radiology reports based on our novel information extraction schema, consisting of 600 reports with 30K radiologist annotations and 221K reports with 10.5M automatically generated annotations.
Tatoeba is a free collection of example sentences with translations geared towards foreign language learners. It is available in more than 400 languages. Its name comes from the Japanese phrase “tatoeba” (例えば), meaning “for example”. It is written and maintained by a community of volunteers through a model of open collaboration. Individual contributors are known as Tatoebans.
37 PAPERS • 26 BENCHMARKS
Break is a question understanding dataset, aimed at training models to reason over complex questions. It features 83,978 natural language questions, annotated with a new meaning representation, Question Decomposition Meaning Representation (QDMR). Each example has the natural question along with its QDMR representation. Break contains human composed questions, sampled from 10 leading question-answering benchmarks over text, images and databases. This dataset was created by a team of NLP researchers at Tel Aviv University and Allen Institute for AI.
36 PAPERS • NO BENCHMARKS YET
BillSum is the first dataset for summarization of US Congressional and California state bills.
36 PAPERS • 2 BENCHMARKS
CSL-Daily (Chinese Sign Language Corpus) is a large-scale continuous SLT dataset. It provides both spoken language translations and gloss-level annotations. The topic revolves around people's daily lives (e.g., travel, shopping, medical care), the most likely SLT application scenario.
36 PAPERS • 4 BENCHMARKS
MusicCaps is a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts. For each 10-second music clip, MusicCaps provides:
36 PAPERS • 1 BENCHMARK
Samanantar is the largest publicly available parallel corpora collection for Indic languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. The corpus has 49.6M sentence pairs between English to Indian Languages.
The SemEval-2018 hypernym discovery evaluation benchmark (Camacho-Collados et al. 2018) contains three domains (general, medical and music) and is also available in Italian and Spanish (not in this repository). For each domain a target corpus and vocabulary (i.e. hypernym search space) are provided. The dataset contains both concepts (e.g. dog) and entities (e.g. Manchester United) up to trigrams.
36 PAPERS • 3 BENCHMARKS
WMT 2018 is a collection of datasets used in shared tasks of the Third Conference on Machine Translation. The conference builds on a series of twelve previous annual workshops and conferences on Statistical Machine Translation.
36 PAPERS • 6 BENCHMARKS
The BIOSSES data set comprises total 100 sentence pairs all of which were selected from the "TAC2 Biomedical Summarization Track Training Data Set" .
35 PAPERS • 3 BENCHMARKS
Contains around 200K dialogs with a total of 1.6M turns. Further, unlike existing large scale QA datasets which contain simple questions that can be answered from a single tuple, the questions in the dialogs require a larger subgraph of the KG.
35 PAPERS • NO BENCHMARKS YET
ECHR is an English legal judgment prediction dataset of cases from the European Court of Human Rights (ECHR). The dataset contains ~11.5k cases, including the raw text.
35 PAPERS • 1 BENCHMARK
FEVEROUS (Fact Extraction and VERification Over Unstructured and Structured information) is a fact verification dataset which consists of 87,026 verified claims. Each claim is annotated with evidence in the form of sentences and/or cells from tables in Wikipedia, as well as a label indicating whether this evidence supports, refutes, or does not provide enough information to reach a verdict.
Legal General Language Understanding Evaluation (LexGLUE) benchmark is a collection of datasets for evaluating model performance across a diverse set of legal NLU tasks in a standardized way.
Dataset Summary
DialogRE is the first human-annotated dialogue-based relation extraction dataset, containing 1,788 dialogues originating from the complete transcripts of a famous American television situation comedy Friends. The are annotations for all occurrences of 36 possible relation types that exist between an argument pair in a dialogue. DialogRE is available in English and Chinese.
34 PAPERS • 1 BENCHMARK
For goal-oriented document-grounded dialogs, it often involves complex contexts for identifying the most relevant information, which requires better understanding of the inter-relations between conversations and documents. Meanwhile, many online user-oriented documents use both semi-structured and unstructured contents for guiding users to access information of different contexts. Thus, we create a new goal-oriented document-grounded dialogue dataset that captures more diverse scenarios derived from various document contents from multiple domains such ssa.gov and studentaid.gov. For data collection, we propose a novel pipeline approach for dialogue data construction, which has been adapted and evaluated for several domains.
34 PAPERS • NO BENCHMARKS YET