This datasets consists of challenging reasoning questions in multiple choice format.
1 PAPER • NO BENCHMARKS YET
The OCW dataset is for evaluating creative problem solving tasks by curating the problems and human performance results from the popular British quiz show Only Connect.
10 PAPERS • 1 BENCHMARK
Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. This dataset enables the evaluation of mono- and multi-lingual models in high-, medium-, and low-resource languages. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset. The human annotation procedure was carefully curated to create questions that discriminate between different levels of generalizable language comprehension and is reinforced by extensive quality checks. While all questions directly relate to the passage, the English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. Belebele opens up new avenues for evaluating and analyzing the multilingual abilities of language models and NLP systems.
17 PAPERS • NO BENCHMARKS YET
GeoGLUE is a GeoGraphic Language Understanding Evaluation benchmark, which consists of six geographic text-related tasks, including geographic textual similarity on recall, geotagged geographic elements tagging, geographic composition analysis, geographic where what cut, and geographic entity alignment. All tasks' datasets are collected from open-released resources.
2 PAPERS • NO BENCHMARKS YET
ChatLog is a coarse-to-fine temporal dataset called ChatLog, consisting of two parts that update monthly and daily:
We propose a test to measure the multitask accuracy of large Chinese language models. We constructed a large-scale, multi-task test consisting of single and multiple-choice questions from various branches of knowledge. The test encompasses the fields of medicine, law, psychology, and education, with medicine divided into 15 sub-tasks and education into 8 sub-tasks. The questions in the dataset were manually collected by professionals from freely available online resources, including university medical examinations, national unified legal professional qualification examinations, psychological counselor exams, graduate entrance examinations for psychology majors, and the Chinese National College Entrance Examination. In total, we collected 11,900 questions, which we divided into a few-shot development set and a test set. The few-shot development set contains 5 questions per topic, amounting to 55 questions in total. The test set comprises 11,845 questions.
9 PAPERS • NO BENCHMARKS YET
NusaCrowd is a collaborative initiative to collect and unite existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, the authors have has brought together 137 datasets and 117 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their effectiveness has been demonstrated in multiple experiments.
IDK-MRC is an Indonesian Machine Reading Comprehension (MRC) dataset consists of more than 10K questions in total with over 5K unanswerable questions with diverse question types.
ExPUNations is a humor dataset with such extensive and fine-grained annotations specifically for puns. This dataset is designed for two new tasks namely, explanation generation to aid with pun classification and keyword-conditioned pun generation
Contains 1507 domain-expert annotated consumer health questions and corresponding summaries. The dataset is derived from the community question answering forum and therefore provides a valuable resource for understanding consumer health-related posts on social media.
3 PAPERS • NO BENCHMARKS YET
Phrase in Context is a curated benchmark for phrase understanding and semantic search, consisting of three tasks of increasing difficulty: Phrase Similarity (PS), Phrase Retrieval (PR) and Phrase Sense Disambiguation (PSD). The datasets are annotated by 13 linguistic experts on Upwork and verified by two groups: ~1000 AMT crowdworkers and another set of 5 linguistic experts. PiC benchmark is distributed under CC-BY-NC 4.0.
RuMedBench is a benchmark dataset for Russian medical language understanding.
nlu++ is a dataset for natural language understanding (NLU) in task-oriented dialogue (ToD) systems, with the aim to provide a much more challenging evaluation environment for dialogue NLU models, up to date with the current application and industry requirements. nlu++ is divided into two domains (banking and hotels) and brings several crucial improvements over current commonly used NLU datasets. 1) Nlu++ provides fine-grained domain ontologies with a large set of challenging multi-intent sentences, introducing and validating the idea of intent modules that can be combined into complex intents that convey complex user goals, combined with finer-grained and thus more challenging slot sets. 2) The ontology is divided into domain-specific and generic (i.e., domain-universal) intent modules that overlap across domains, promoting cross-domain reusability of annotated examples. 3) The dataset design has been inspired by the problems observed in industrial ToD systems, and 4) it has been coll
6 PAPERS • NO BENCHMARKS YET
The CANDOR corpus is a large, novel, multimodal corpus of 1,656 recorded conversations in spoken English. This 7+ million word, 850 hour corpus totals over 1TB of audio, video, and transcripts, with moment-to-moment measures of vocal, facial, and semantic expression, along with an extensive survey of speaker post conversation reflections.
MuLD (Multitask Long Document Benchmark) is a set of 6 NLP tasks where the inputs consist of at least 10,000 words. The benchmark covers a wide variety of task types including translation, summarization, question answering, and classification. Additionally there is a range of output lengths from a single word classification label all the way up to an output longer than the input text.
3 PAPERS • 6 BENCHMARKS
We release various types of word embeddings for multiple Indian languages. Please note that for a majority of our work, we had transliterated the corpora to the Devanagiri script and the script is changed. Word Embedding models using FastText, ElMo, and cross-lingual models based on an orthogonal alignment of monolingual models for all pairs of these languages.
The dataset contains 3304 cases from the Supreme Court of the United States from 1955 to 2021. Each case has the case's identifiers as well as the facts of the case and the decision outcome. Other related datasets rarely included the facts of the case which could prove to be helpful in natural language processing applications. One potential use case of this dataset is determining the outcome of a case using its facts.
Moral Stories is a crowd-sourced dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations, and their respective consequences.
19 PAPERS • NO BENCHMARKS YET
CLUES (Constrained Language Understanding Evaluation Standard) is a benchmark for evaluating the few-shot learning capabilities of NLU models.
Legal General Language Understanding Evaluation (LexGLUE) benchmark is a collection of datasets for evaluating model performance across a diverse set of legal NLU tasks in a standardized way.
35 PAPERS • 1 BENCHMARK
Introduction The FewGLUE_64_labeled dataset is a new version of FewGLUE dataset. It contains a 64-sample training set, a development set (the original SuperGLUE development set), a test set, and an unlabeled set. It is constructed to facilitate the research of few-shot learning for natural language understanding tasks.
XWINO is a multilingual collection of Winograd Schemas in six languages that can be used for evaluation of cross-lingual commonsense reasoning capabilities.
3 PAPERS • 1 BENCHMARK
Source: BARThez: a Skilled Pretrained French Sequence-to-Sequence Model
7 PAPERS • 3 BENCHMARKS
MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.
694 PAPERS • 25 BENCHMARKS
ImagiFilter focusses on photographic and/or natural images, a very common use-case in computer vision research. Annotations for coarse prediction are provided, i.e. photographic vs. non-photographic, and smaller fine-grained prediction tasks where the non-photographic class is broken down into five classes: maps, drawings, graphs, icons, and sketches.
EPIC30M contains a subset of 26.2 millions tweets related to three general diseases, namely Ebola, Cholera and Swine Flu, and another subset of 4.7 millions tweets of six global epidemic outbreaks, including 2009 H1N1 Swine Flu, 2010 Haiti Cholera, 2012 Middle-East Respiratory Syndrome (MERS), 2013 West African Ebola, 2016 Yemen Cholera and 2018 Kivu Ebola.
Emotional Dialogue Acts data contains dialogue act labels for existing emotion multi-modal conversational datasets. We chose two popular multimodal emotion datasets: Multimodal EmotionLines Dataset (MELD) and Interactive Emotional dyadic MOtion CAPture database (IEMOCAP). EDAs reveal associations between dialogue acts and emotional states in a natural-conversational language such as Accept/Agree dialogue acts often occur with the Joy emotion, Apology with Sadness, and Thanking with Joy.
The KLEJ benchmark (Kompleksowa Lista Ewaluacji Językowych) is a set of nine evaluation tasks for the Polish language understanding task.
11 PAPERS • NO BENCHMARKS YET
XGLUE is an evaluation benchmark XGLUE,which is composed of 11 tasks that span 19 languages. For each task, the training data is only available in English. This means that to succeed at XGLUE, a model must have a strong zero-shot cross-lingual transfer capability to learn from the English data of a specific task and transfer what it learned to other languages. Comparing to its concurrent work XTREME, XGLUE has two characteristics: First, it includes cross-lingual NLU and cross-lingual NLG tasks at the same time; Second, besides including 5 existing cross-lingual tasks (i.e. NER, POS, MLQA, PAWS-X and XNLI), XGLUE selects 6 new tasks from Bing scenarios as well, including News Classification (NC), Query-Ad Matching (QADSM), Web Page Ranking (WPR), QA Matching (QAM), Question Generation (QG) and News Title Generation (NTG). Such diversities of languages, tasks and task origin provide a comprehensive benchmark for quantifying the quality of a pre-trained model on cross-lingual natural lan
20 PAPERS • 2 BENCHMARKS
CrossWOZ is the first large-scale Chinese Cross-Domain Wizard-of-Oz task-oriented dataset. It contains 6K dialogue sessions and 102K utterances for 5 domains, including hotel, restaurant, attraction, metro, and taxi. Moreover, the corpus contains rich annotation of dialogue states and dialogue acts at both user and system sides.
25 PAPERS • NO BENCHMARKS YET
JEC-QA is a LQA (Legal Question Answering) dataset collected from the National Judicial Examination of China. It contains 26,365 multiple-choice and multiple-answer questions in total. The task of the dataset is to predict the answer using the questions and relevant articles. To do well on JEC-QA, both retrieving and answering are important.
Taskmaster-1 is a dialog dataset consisting of 13,215 task-based dialogs in English, including 5,507 spoken and 7,708 written dialogs created with two distinct procedures. Each conversation falls into one of six domains: ordering pizza, creating auto repair appointments, setting up ride service, ordering movie tickets, ordering coffee drinks and making restaurant reservations.
18 PAPERS • NO BENCHMARKS YET
VisPro dataset contains coreference annotation of 29,722 pronouns from 5,000 dialogues.
An unsupervised dataset for co-reference resolution. Presented in the publication: Kocijan et. al, WikiCREM: A Large Unsupervised Corpus for Coreference Resolution, presented at EMNLP 2019.
CosmosQA is a large-scale dataset of 35.6K problems that require commonsense-based reading comprehension, formulated as multiple-choice questions. It focuses on reading between the lines over a diverse collection of people’s everyday narratives, asking questions concerning on the likely causes or effects of events that require reasoning beyond the exact text spans in the context.
88 PAPERS • NO BENCHMARKS YET
General Language Understanding Evaluation (GLUE) benchmark is a collection of nine natural language understanding tasks, including single-sentence tasks CoLA and SST-2, similarity and paraphrasing tasks MRPC, STS-B and QQP, and natural language inference tasks MNLI, QNLI, RTE and WNLI.
2,708 PAPERS • 25 BENCHMARKS
RecipeQA is a dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from approximately 20K unique recipes with step-by-step instructions and images. Each question in RecipeQA involves multiple modalities such as titles, descriptions or images, and working towards an answer requires (i) joint understanding of images and text, (ii) capturing the temporal flow of events, and (iii) making sense of procedural knowledge.
23 PAPERS • 1 BENCHMARK
MCScript is used as the official dataset of SemEval2018 Task11. This dataset constructs a collection of text passages about daily life activities and a series of questions referring to each passage, and each question is equipped with two answer choices. The MCScript comprises 9731, 1411, and 2797 questions in training, development, and test set respectively.
24 PAPERS • NO BENCHMARKS YET
The Multi-domain Wizard-of-Oz (MultiWOZ) dataset is a large-scale human-human conversational corpus spanning over seven domains, containing 8438 multi-turn dialogues, with each dialogue averaging 14 turns. Different from existing standard datasets like WOZ and DSTC2, which contain less than 10 slots and only a few hundred values, MultiWOZ has 30 (domain, slot) pairs and over 4,500 possible values. The dialogues span seven domains: restaurant, hotel, attraction, taxi, train, hospital and police.
316 PAPERS • 11 BENCHMARKS
Dialog-based Language Learning dataset is designed to measure how well models can perform at learning as a student given a teacher’s textual responses to the student’s answer (as well as potentially receiving an external real-valued reward signal).
STREUSLE stands for Supersense-Tagged Repository of English with a Unified Semantics for Lexical Expressions. The text is from the web reviews portion of the English Web Treebank [9]. STREUSLE incorporates comprehensive annotations of multiword expressions (MWEs) [1] and semantic supersenses for lexical expressions. The supersense labels apply to single- and multiword noun and verb expressions, as described in [2], and prepositional/possessive expressions, as described in [3, 4, 5, 6, 7, 8]. Lexical expressions also feature a lexical category label indicating its holistic grammatical status; for verbal multiword expressions, these labels incorporate categories from the PARSEME 1.1 guidelines [15]. For each token, these pieces of information are concatenated together into a lextag: a sentence's words and their lextags are sufficient to recover lexical categories, supersenses, and multiword expressions [8].
21 PAPERS • 1 BENCHMARK
Algebra Question Answering with Rationales (AQUA-RAT) is a dataset that contains algebraic word problems with rationales. The dataset consists of about 100,000 algebraic word problems with natural language rationales. Each problem is a json object consisting of four parts: * question - A natural language definition of the problem to solve * options - 5 possible options (A, B, C, D and E), among which one is correct * rationale - A natural language description of the solution to the problem * correct - The correct option
39 PAPERS • NO BENCHMARKS YET
A comprehensive multi-task benchmark for the Polish language understanding, accompanied by an online leaderboard. It consists of a diverse set of tasks, adopted from existing datasets for named entity recognition, question-answering, textual entailment, and others.
A benchmark Arabic dataset for commonsense understanding and validation as well as a baseline research and models trained using the same dataset.
A new publicly available dataset for verification of climate change-related claims.
30 PAPERS • 1 BENCHMARK
CLUE is a Chinese Language Understanding Evaluation benchmark. It consists of different NLU datasets. It is a community-driven project that brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension, all on original Chinese text.
95 PAPERS • 8 BENCHMARKS
Composes sentence pairs (i.e., twin sentences).
7 PAPERS • NO BENCHMARKS YET
DialoGLUE is a natural language understanding benchmark for task-oriented dialogue designed to encourage dialogue research in representation-based transfer, domain adaptation, and sample-efficient task learning. It consisting of 7 task-oriented dialogue datasets covering 4 distinct natural language understanding tasks.
16 PAPERS • 2 BENCHMARKS