DpgMedia2019 is a Dutch news dataset for partisanship detection. It contains more than 100K articles that are labelled on the publisher level and 776 articles that were crowdsourced using an internal survey platform and labelled on the article level.
1 PAPER • NO BENCHMARKS YET
Event2Mind is a corpus of 25,000 event phrases covering a diverse range of everyday events and situations.
19 PAPERS • 2 BENCHMARKS
HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy).
487 PAPERS • 1 BENCHMARK
Consists of visual arithmetic problems automatically generated using a grammar model--And-Or Graph (AOG). These visual arithmetic problems are in the form of geometric figures: each problem has a set of geometric shapes as its context and embedded number symbols.
2 PAPERS • NO BENCHMARKS YET
Contains 13.6k masked-word-prediction probes, 10.5k for fine-tuning and 3.1k for testing.
5 PAPERS • NO BENCHMARKS YET
Useful for through two applications - automatic readability assessment and automatic text simplification. The corpus consists of 189 texts, each in three versions (567 in total).
27 PAPERS • NO BENCHMARKS YET
Benchmark for physical reasoning that contains a set of simple classical mechanics puzzles in a 2D physical environment. The benchmark is designed to encourage the development of learning algorithms that are sample-efficient and generalize well across puzzles.
32 PAPERS • 2 BENCHMARKS
ProtoQA is a question answering dataset for training and evaluating common sense reasoning capabilities of artificial intelligence systems in such prototypical situations. The training set is gathered from an existing set of questions played in a long-running international game show FAMILY- FEUD. The hidden evaluation set is created by gathering answers for each question from 100 crowd-workers.
9 PAPERS • NO BENCHMARKS YET
WinoGrande is a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations.
377 PAPERS • 1 BENCHMARK