SuperGLUE is a benchmark dataset designed to pose a more rigorous test of language understanding than GLUE. SuperGLUE has the same high-level motivation as GLUE: to provide a simple, hard-to-game measure of progress toward general-purpose language understanding technologies for English. SuperGLUE follows the basic design of GLUE: It consists of a public leaderboard built around eight language understanding tasks, drawing on existing data, accompanied by a single-number performance metric, and an analysis toolkit. However, it improves upon GLUE in several ways:
449 PAPERS • 16 BENCHMARKS
The Winograd Schema Challenge was introduced both as an alternative to the Turing Test and as a test of a system’s ability to do commonsense reasoning. A Winograd schema is a pair of sentences differing in one or two words with a highly ambiguous pronoun, resolved differently in the two sentences, that appears to require commonsense knowledge to be resolved correctly. The examples were designed to be easily solvable by humans but difficult for machines, in principle requiring a deep understanding of the content of the text and the situation it describes.
217 PAPERS • 4 BENCHMARKS
The CommonsenseQA is a dataset for commonsense question answering task. The dataset consists of 12,247 questions with 5 choices each. The dataset was generated by Amazon Mechanical Turk workers in the following process (an example is provided in parentheses):
190 PAPERS • 2 BENCHMARKS
Given a partial description like "she opened the hood of the car," humans can reason about the situation and anticipate what might come next ("then, she examined the engine"). SWAG (Situations With Adversarial Generations) is a large-scale dataset for this task of grounded commonsense inference, unifying natural language inference and physically grounded reasoning.
116 PAPERS • 3 BENCHMARKS
Visual Dialog (VisDial) dataset contains human annotated questions based on images of MS COCO dataset. This dataset was developed by pairing two subjects on Amazon Mechanical Turk to chat about an image. One person was assigned the job of a ‘questioner’ and the other person acted as an ‘answerer’. The questioner sees only the text description of an image (i.e., an image caption from MS COCO dataset) and the original image remains hidden to the questioner. Their task is to ask questions about this hidden image to “imagine the scene better”. The answerer sees the image, caption and answers the questions asked by the questioner. The two of them can continue the conversation by asking and answering questions for 10 rounds at max.
114 PAPERS • 5 BENCHMARKS
The AI2’s Reasoning Challenge (ARC) dataset is a multiple-choice question-answering dataset, containing questions from science exams from grade 3 to grade 9. The dataset is split in two partitions: Easy and Challenge, where the latter partition contains the more difficult questions that require reasoning. Most of the questions have 4 answer choices, with <1% of all the questions having either 3 or 5 answer choices. ARC includes a supporting KB of 14.3M unstructured text passages.
107 PAPERS • 2 BENCHMARKS
OpenBookQA is a new kind of question-answering dataset modeled after open book exams for assessing human understanding of a subject. It consists of 5,957 multiple-choice elementary-level science questions (4,957 train, 500 dev, 500 test), which probe the understanding of a small “book” of 1,326 core science facts and the application of these facts to novel situations. For training, the dataset includes a mapping from each question to the core science fact it was designed to probe. Answering OpenBookQA questions requires additional broad common knowledge, not contained in the book. The questions, by design, are answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. Additionally, the dataset includes a collection of 5,167 crowd-sourced common knowledge facts, and an expanded version of the train/dev/test questions where each question is associated with its originating core fact, a human accuracy score, a clarity score, and an anonymized crowd-worker
98 PAPERS • 3 BENCHMARKS
Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) is a large-scale reading comprehension dataset which requires commonsense reasoning. ReCoRD consists of queries automatically generated from CNN/Daily Mail news articles; the answer to each query is a text span from a summarizing passage of the corresponding news. The goal of ReCoRD is to evaluate a machine's ability of commonsense reasoning in reading comprehension. ReCoRD is pronounced as [ˈrɛkərd].
76 PAPERS • 2 BENCHMARKS
CoS-E consists of human explanations for commonsense reasoning in the form of natural language sequences and highlighted annotations
31 PAPERS • NO BENCHMARKS YET
MCScript is used as the official dataset of SemEval2018 Task11. This dataset constructs a collection of text passages about daily life activities and a series of questions referring to each passage, and each question is equipped with two answer choices. The MCScript comprises 9731, 1411, and 2797 questions in training, development, and test set respectively.
22 PAPERS • NO BENCHMARKS YET
Useful for through two applications - automatic readability assessment and automatic text simplification. The corpus consists of 189 texts, each in three versions (567 in total).
RecipeQA is a dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from approximately 20K unique recipes with step-by-step instructions and images. Each question in RecipeQA involves multiple modalities such as titles, descriptions or images, and working towards an answer requires (i) joint understanding of images and text, (ii) capturing the temporal flow of events, and (iii) making sense of procedural knowledge.
21 PAPERS • 1 BENCHMARK
Benchmark for physical reasoning that contains a set of simple classical mechanics puzzles in a 2D physical environment. The benchmark is designed to encourage the development of learning algorithms that are sample-efficient and generalize well across puzzles.
20 PAPERS • 2 BENCHMARKS
Event2Mind is a corpus of 25,000 event phrases covering a diverse range of everyday events and situations.
19 PAPERS • 2 BENCHMARKS
The COmmonsense Dataset Adversarially-authored by Humans (CODAH) is an evaluation set for commonsense question-answering in the sentence completion style of SWAG. As opposed to other automatically generated NLI datasets, CODAH is adversarially constructed by humans who can view feedback from a pre-trained model and use this information to design challenging commonsense questions. It contains 2801 questions in total, and uses 5-fold cross validation for evaluation.
18 PAPERS • 2 BENCHMARKS
Moral Stories is a crowd-sourced dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations, and their respective consequences.
11 PAPERS • NO BENCHMARKS YET
Question: I have five fingers but I am not alive. What am I? Answer: a glove.
9 PAPERS • 1 BENCHMARK
Rainbow is multi-task benchmark for common-sense reasoning that uses different existing QA datasets: aNLI, Cosmos QA, HellaSWAG. Physical IQa, Social IQa, WinoGrande.
7 PAPERS • NO BENCHMARKS YET
CITE is a crowd-sourced resource for multimodal discourse: this resource characterises inferences in image-text contexts in the domain of cooking recipes in the form of coherence relations.
6 PAPERS • 1 BENCHMARK
OMICS is an extensive collection of knowledge for indoor service robots gathered from internet users. Currently, it contains 48 tables capturing different sorts of knowledge. Each tuple of the Help table maps a user desire to a task that may meet the desire (e.g., ⟨ “feel thirsty”, “by offering drink” ⟩). Each tuple of the Tasks/Steps table decomposes a task into several steps (e.g., ⟨ “serve a drink”, 0. “get a glass”, 1. “get a bottle”, 2. “fill class from bottle”, 3. “give class to person” ⟩). Given this, OMICS offers useful knowledge about hierarchism of naturalistic instructions, where a high-level user request (e.g., “serve a drink”) can be reduced to lower-level tasks (e.g., “get a glass”, ⋯). Another feature of OMICS is that elements of any tuple in an OMICS table are semantically related according to a predefined template. This facilitates the semantic interpretation of the OMICS tuples.
6 PAPERS • NO BENCHMARKS YET
A testbed for commonsense reasoning about entity knowledge, bridging fact-checking about entities with commonsense inferences.
5 PAPERS • NO BENCHMARKS YET
Contains 13.6k masked-word-prediction probes, 10.5k for fine-tuning and 3.1k for testing.
Choice of Plausible Alternatives for Russian language (PARus) evaluation provides researchers with a tool for assessing progress in open-domain commonsense causal reasoning. Each question in PARus is composed of a premise and two alternatives, where the task is to select the alternative that more plausibly has a causal relation with the premise. The correct alternative is randomized so that the expected performance of randomly guessing is 50%.
5 PAPERS • 1 BENCHMARK
ProtoQA is a question answering dataset for training and evaluating common sense reasoning capabilities of artificial intelligence systems in such prototypical situations. The training set is gathered from an existing set of questions played in a long-running international game show FAMILY- FEUD. The hidden evaluation set is created by gathering answers for each question from 100 crowd-workers.
A Winograd schema is a pair of sentences that differ in only one or two words and that contain an ambiguity that is resolved in opposite ways in the two sentences and requires the use of world knowledge and reasoning for its resolution. The schema takes its name from a well-known example by Terry Winograd.
CC-Stories (or STORIES) is a dataset for common sense reasoning and language modeling. It was constructed by aggregating documents from the CommonCrawl dataset that has the most overlapping n-grams with the questions in commonsense reasoning tasks. The top 1.0% of highest ranked documents is chosen as the new training corpus.
4 PAPERS • NO BENCHMARKS YET
Current visual question answering (VQA) tasks mainly consider answering human-annotated questions for natural images in the daily-life context. Icon question answering (IconQA) is a benchmark which aims to highlight the importance of abstract diagram understanding and comprehensive cognitive reasoning in real-world diagram word problems. For this benchmark, a large-scale IconQA dataset is built that consists of three sub-tasks: multi-image-choice, multi-text-choice, and filling-in-the-blank. Compared to existing VQA benchmarks, IconQA requires not only perception skills like object recognition and text understanding, but also diverse cognitive reasoning skills, such as geometric reasoning, commonsense reasoning, and arithmetic reasoning.
4 PAPERS • 1 BENCHMARK
Russian reading comprehension with Commonsense reasoning (RuCoS) is a large-scale reading comprehension dataset that requires commonsense reasoning. RuCoS consists of queries automatically generated from CNN/Daily Mail news articles; the answer to each query is a text span from a summarizing passage of the corresponding news. The goal of RuCoS is to evaluate a machine`s ability of commonsense reasoning in reading comprehension.
The Sarcasm Corpus contains sarcastic and non-sarcastic utterances of three different types, which are balanced with half of the samples being sarcastic and half non-sarcastic. The three types are:
We provide the BCOPA-CE test set, which has balanced token distribution in the correct and wrong alternatives and increases the difficulty of being aware of cause and effect.
3 PAPERS • NO BENCHMARKS YET
Complementary Commonsense (Com2Sense) is a dataset for benchmarking commonsense reasoning ability of NLP models. This dataset contains 4k statement true/false sentence pairs. The dataset is crowdsourced and enhanced with an adversarial model-in-the-loop setup to incentivize challenging samples. To facilitate a systematic analysis of commonsense capabilities, the dataset is designed along the dimensions of knowledge domains, reasoning scenarios and numeracy.
TimeDial presents a crowdsourced English challenge set, for temporal commonsense reasoning, formulated as a multiple choice cloze task with around 1.5k carefully curated dialogs. The dataset is derived from the DailyDialog, which is a multi-turn dialog corpus.
Fig-QA consists of 10256 examples of human-written creative metaphors that are paired as a Winograd schema. It can be used to evaluate the commonsense reasoning of models. The metaphors themselves can also be used as training data for other tasks, such as metaphor detection or generation.
2 PAPERS • NO BENCHMARKS YET
Consists of visual arithmetic problems automatically generated using a grammar model--And-Or Graph (AOG). These visual arithmetic problems are in the form of geometric figures: each problem has a set of geometric shapes as its context and embedded number symbols.
A Benchmark for Robust Multi-Hop Spatial Reasoning in Texts
2 PAPERS • 1 BENCHMARK
This dataset is collected via the WinoGAViL game to collect challenging vision-and-language associations. Inspired by the popular card game Codenames, a “spymaster” gives a textual cue related to several visual candidates, and another player has to identify them.
2 PAPERS • 2 BENCHMARKS
X-CSQA is a multilingual dataset for Commonsense reasoning research, based on CSQA.
The Advice-Seeking Questions (ASQ) dataset is a collection of personal narratives with advice-seeking questions. The dataset has been split into train, test, heldout sets, with 8865, 2500, 10000 test instances each. This dataset is used to train and evaluate methods that can infer what is the advice-seeking goal behind a personal narrative. This task is formulated as a cloze test, where the goal is to identify which of two advice-seeking questions was removed from a given narrative.
1 PAPER • NO BENCHMARKS YET
DiscoSense is a benchmark sourced from datasets that contain two sentences connected through a discourse connective. Specifically, it is sourced from two peer reviewed academic datasets, DISCOVERY and DISCOFUSE for commonsense reasoning via understanding a wide variety of discourse connectives.
DpgMedia2019 is a Dutch news dataset for partisanship detection. It contains more than 100K articles that are labelled on the publisher level and 776 articles that were crowdsourced using an internal survey platform and labelled on the article level.
General-purpose Visual Understanding Evaluation (G-VUE) is a comprehensive benchmark covering the full spectrum of visual cognitive abilities with four functional domains -- Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation.
Housekeep a benchmark to evaluate common sense reasoning in the home for embodied AI. In Housekeep, an embodied agent must tidy a house by rearranging misplaced objects without explicit instructions specifying which objects need to be rearranged. The dataset contains where humans typically place objects in tidy and untidy houses constituting 1799 objects, 268 object categories, 585 placements, and 105 rooms.
PACS (Physical Audiovisual CommonSense) is the first audiovisual benchmark annotated for physical commonsense attributes. PACS contains a total of 13,400 question-answer pairs, involving 1,377 unique physical commonsense questions and 1,526 videos. The dataset provides new opportunities to advance the research field of physical reasoning by bringing audio as a core component of this multimodal problem.
1 PAPER • 1 BENCHMARK
The work provides a comprehensive overview of the corpus for the Russian language for the commonsense inference task. Namely, we construct event phrases, which cover a wide range of everyday situations with labelled intents and reactions of the event main participant and emotions of other people involved.
A fundamental component of human vision is our ability to parse complex visual scenes and judge the relations between their constituent objects. AI benchmarks for visual reasoning have driven rapid progress in recent years with state-of-the-art systems now reaching human accuracy on some of these benchmarks. Yet, there remains a major gap between humans and AI systems in terms of the sample efficiency with which they learn new visual reasoning tasks. Humans' remarkable efficiency at learning has been at least partially attributed to their ability to harness compositionality -- allowing them to efficiently take advantage of previously gained knowledge when learning new tasks. Here, we introduce a novel visual reasoning benchmark, Compositional Visual Relations (CVR), to drive progress towards the development of more data-efficient learning algorithms. We take inspiration from fluidic intelligence and non-verbal reasoning tests and describe a novel method for creating compositions of abs
0 PAPER • NO BENCHMARKS YET