TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Sentence Completion	HellaSwag	BERT-Large 340M	Accuracy	47.3	# 67
Sentence Completion	HellaSwag	BERT-Base 110M	Accuracy	40.5	# 73
Sentence Completion	HellaSwag	GPT-1 117M	Accuracy	41.7	# 69
Sentence Completion	HellaSwag	ESIM + ElMo	Accuracy	33.3	# 78
Sentence Completion	HellaSwag	LSTM + BERT-Base	Accuracy	36.2	# 76
Sentence Completion	HellaSwag	LSTM + ElMo	Accuracy	31.4	# 82
Sentence Completion	HellaSwag	LSTM + GloVe	Accuracy	31.7	# 80
Sentence Completion	HellaSwag	fastText	Accuracy	31.6	# 81
Sentence Completion	HellaSwag	Random chance baseline	Accuracy	25	# 85

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hellaswag-can-a-machine-really-finish-your/sentence-completion-on-hellaswag)](https://paperswithcode.com/sota/sentence-completion-on-hellaswag?p=hellaswag-can-a-machine-really-finish-your)`

HellaSwag: Can a Machine Really Finish Your Sentence?

ACL 2019 · Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, Yejin Choi ·

Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT, near human-level performance was reached. Does this mean that machines can perform human level commonsense inference? In this paper, we show that commonsense inference still proves difficult for even state-of-the-art models, by presenting HellaSwag, a new challenge dataset. Though its questions are trivial for humans (>95% accuracy), state-of-the-art models struggle (<48%). We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models. Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.

PDF Abstract ACL 2019 PDF ACL 2019 Abstract

Code

Add Remove Mark official

facebookresearch/text_characterizat…

PlusLabNLP/Plot-guided-Coherence-Ev…

Tasks

Add Remove

Natural Language Inference

Sentence

Sentence Completion

Datasets

Introduced in the Paper:

HellaSwag

Used in the Paper:

ActivityNet Captions

SWAG

Results from the Paper

Edit

Ranked #67 on Sentence Completion on HellaSwag

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Sentence Completion	HellaSwag	BERT-Large 340M	Accuracy	47.3	# 67	Compare
Sentence Completion	HellaSwag	BERT-Base 110M	Accuracy	40.5	# 73	Compare
Sentence Completion	HellaSwag	GPT-1 117M	Accuracy	41.7	# 69	Compare
Sentence Completion	HellaSwag	ESIM + ElMo	Accuracy	33.3	# 78	Compare
Sentence Completion	HellaSwag	LSTM + BERT-Base	Accuracy	36.2	# 76	Compare
Sentence Completion	HellaSwag	LSTM + ElMo	Accuracy	31.4	# 82	Compare
Sentence Completion	HellaSwag	LSTM + GloVe	Accuracy	31.7	# 80	Compare
Sentence Completion	HellaSwag	fastText	Accuracy	31.6	# 81	Compare
Sentence Completion	HellaSwag	Random chance baseline	Accuracy	25	# 85	Compare

Methods

Add Remove

Adam • Attention Dropout • BERT • Dense Connections • Dropout • GELU • Layer Normalization • Linear Layer • Linear Warmup With Linear Decay • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Weight Decay • WordPiece

Edit Social Preview

HellaSwag: Can a Machine Really Finish Your Sentence?

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove