TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Reading Comprehension	AdversarialQA	RoBERTa-Large	D(BiDAF): F1	74.1	# 1
Reading Comprehension	AdversarialQA	RoBERTa-Large	D(BERT): F1	65.5	# 1
Reading Comprehension	AdversarialQA	RoBERTa-Large	D(RoBERTa): F1	53.4	# 2
Reading Comprehension	AdversarialQA	RoBERTa-Large	Overall: F1	64.4	# 1
Reading Comprehension	AdversarialQA	BERT-Large	D(BiDAF): F1	71.3	# 2
Reading Comprehension	AdversarialQA	BERT-Large	D(BERT): F1	62.4	# 2
Reading Comprehension	AdversarialQA	BERT-Large	D(RoBERTa): F1	54.4	# 1
Reading Comprehension	AdversarialQA	BERT-Large	Overall: F1	62.7	# 2
Reading Comprehension	AdversarialQA	BiDAF	D(BiDAF): F1	28.6	# 3
Reading Comprehension	AdversarialQA	BiDAF	D(BERT): F1	30.2	# 3
Reading Comprehension	AdversarialQA	BiDAF	D(RoBERTa): F1	26.7	# 3
Reading Comprehension	AdversarialQA	BiDAF	Overall: F1	28.5	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/beat-the-ai-investigating-adversarial-human/reading-comprehension-on-adversarialqa)](https://paperswithcode.com/sota/reading-comprehension-on-adversarialqa?p=beat-the-ai-investigating-adversarial-human)`

Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension

2 Feb 2020 · Max Bartolo, Alastair Roberts, Johannes Welbl, Sebastian Riedel, Pontus Stenetorp ·

Innovations in annotation methodology have been a catalyst for Reading Comprehension (RC) datasets and models. One recent trend to challenge current RC models is to involve a model in the annotation process: humans create questions adversarially, such that the model fails to answer them correctly. In this work we investigate this annotation methodology and apply it in three different settings, collecting a total of 36,000 samples with progressively stronger models in the annotation loop. This allows us to explore questions such as the reproducibility of the adversarial effect, transfer from data collected with varying model-in-the-loop strengths, and generalisation to data collected without a model. We find that training on adversarially collected samples leads to strong generalisation to non-adversarially collected datasets, yet with progressive performance deterioration with increasingly stronger models-in-the-loop. Furthermore, we find that stronger models can still learn from datasets collected with substantially weaker models-in-the-loop. When trained on data collected with a BiDAF model in the loop, RoBERTa achieves 39.9F1 on questions that it cannot answer when trained on SQuAD - only marginally lower than when trained on data collected using RoBERTa itself (41.0F1).

PDF Abstract

Code

Add Remove Mark official

maxbartolo/adversarialQA official

Tasks

Add Remove

Reading Comprehension

Datasets

Introduced in the Paper:

AdversarialQA

Used in the Paper:

SQuAD

Natural Questions

DROP

Results from the Paper

Edit

Ranked #1 on Reading Comprehension on AdversarialQA (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Reading Comprehension	AdversarialQA	RoBERTa-Large	D(BiDAF): F1	74.1	# 1	Compare
			D(BERT): F1	65.5	# 1	Compare
			D(RoBERTa): F1	53.4	# 2	Compare
			Overall: F1	64.4	# 1	Compare
Reading Comprehension	AdversarialQA	BERT-Large	D(BiDAF): F1	71.3	# 2	Compare
			D(BERT): F1	62.4	# 2	Compare
			D(RoBERTa): F1	54.4	# 1	Compare
			Overall: F1	62.7	# 2	Compare
Reading Comprehension	AdversarialQA	BiDAF	D(BiDAF): F1	28.6	# 3	Compare
			D(BERT): F1	30.2	# 3	Compare
			D(RoBERTa): F1	26.7	# 3	Compare
			Overall: F1	28.5	# 3	Compare

Methods

Add Remove

Adam • Attention Dropout • BERT • Dense Connections • Dropout • GELU • Layer Normalization • Linear Layer • Linear Warmup With Linear Decay • Multi-Head Attention • Residual Connection • RoBERTa • Scaled Dot-Product Attention • Softmax • Weight Decay • WordPiece

Edit Social Preview

Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove