TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	EXTRA DATA	REMOVE
Common Sense Reasoning	CODAH	BERT Large	Accuracy	69.6	# 1
Question Answering	CODAH	BERT Large	Accuracy	69.6	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/aqua-an-adversarially-authored-question/common-sense-reasoning-on-codah)](https://paperswithcode.com/sota/common-sense-reasoning-on-codah?p=aqua-an-adversarially-authored-question)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/aqua-an-adversarially-authored-question/question-answering-on-codah)](https://paperswithcode.com/sota/question-answering-on-codah?p=aqua-an-adversarially-authored-question)`

CODAH: An Adversarially Authored Question-Answer Dataset for Common Sense

8 Apr 2019 · Michael Chen, Mike D'Arcy, Alisa Liu, Jared Fernandez, Doug Downey ·

Commonsense reasoning is a critical AI capability, but it is difficult to construct challenging datasets that test common sense. Recent neural question answering systems, based on large pre-trained models of language, have already achieved near-human-level performance on commonsense knowledge benchmarks. These systems do not possess human-level common sense, but are able to exploit limitations of the datasets to achieve human-level scores. We introduce the CODAH dataset, an adversarially-constructed evaluation dataset for testing common sense. CODAH forms a challenging extension to the recently-proposed SWAG dataset, which tests commonsense knowledge using sentence-completion questions that describe situations observed in video. To produce a more difficult dataset, we introduce a novel procedure for question acquisition in which workers author questions designed to target weaknesses of state-of-the-art neural question answering systems. Workers are rewarded for submissions that models fail to answer correctly both before and after fine-tuning (in cross-validation). We create 2.8k questions via this procedure and evaluate the performance of multiple state-of-the-art question answering systems on our dataset. We observe a significant gap between human performance, which is 95.3%, and the performance of the best baseline accuracy of 67.5% by the BERT-Large model.

PDF Abstract