TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Question Answering	BoolQ	BERT-MultiNLI 340M (fine-tuned)	Accuracy	80.4	# 26
Question Answering	BoolQ	GPT-1 117M (fine-tuned)	Accuracy	72.87	# 35
Question Answering	BoolQ	BiDAF + ELMo (fine-tuned)	Accuracy	71.41	# 36
Question Answering	BoolQ	BiDAF-MultiNLI (fine-tuned)	Accuracy	75.57	# 33
Question Answering	BoolQ	Majority baseline	Accuracy	62.17	# 46

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/boolq-exploring-the-surprising-difficulty-of/question-answering-on-boolq)](https://paperswithcode.com/sota/question-answering-on-boolq?p=boolq-exploring-the-surprising-difficulty-of)`

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

NAACL 2019 · Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, Kristina Toutanova ·

In this paper we study yes/no questions that are naturally occurring --- meaning that they are generated in unprompted and unconstrained settings. We build a reading comprehension dataset, BoolQ, of such questions, and show that they are unexpectedly challenging. They often query for complex, non-factoid information, and require difficult entailment-like inference to solve. We also explore the effectiveness of a range of transfer learning baselines. We find that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT. Our best method trains BERT on MultiNLI and then re-trains it on our train set. It achieves 80.4% accuracy compared to 90% accuracy of human annotators (and 62% majority-baseline), leaving a significant gap for future work.

PDF Abstract NAACL 2019 PDF NAACL 2019 Abstract

Code

Add Remove Mark official

google-research-datasets/boolean-qu…

131

Tasks

Add Remove

Question Answering

Reading Comprehension

Transfer Learning

Datasets

Introduced in the Paper:

BoolQ

Used in the Paper:

GLUE

SQuAD

MultiNLI

SNLI

QNLI

Natural Questions

MS MARCO

HotpotQA

RACE

CoQA

QuAC

ShARC

Results from the Paper

Edit

Ranked #26 on Question Answering on BoolQ

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Question Answering	BoolQ	BERT-MultiNLI 340M (fine-tuned)	Accuracy	80.4	# 26	Compare
Question Answering	BoolQ	GPT-1 117M (fine-tuned)	Accuracy	72.87	# 35	Compare
Question Answering	BoolQ	BiDAF + ELMo (fine-tuned)	Accuracy	71.41	# 36	Compare
Question Answering	BoolQ	BiDAF-MultiNLI (fine-tuned)	Accuracy	75.57	# 33	Compare
Question Answering	BoolQ	Majority baseline	Accuracy	62.17	# 46	Compare

Methods

Add Remove

Adam • Attention Dropout • BERT • Dense Connections • Dropout • GELU • Layer Normalization • Linear Layer • Linear Warmup With Linear Decay • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Weight Decay • WordPiece

Edit Social Preview

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove