TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Question Answering	HotpotQA	SAFSR model	ANS-EM	0.589	# 30
Question Answering	HotpotQA	SAFSR model	ANS-F1	0.716	# 31
Question Answering	HotpotQA	SAFSR model	SUP-EM	0.480	# 31
Question Answering	HotpotQA	SAFSR model	SUP-F1	0.757	# 30
Question Answering	HotpotQA	SAFSR model	JOINT-EM	0.345	# 29
Question Answering	HotpotQA	SAFSR model	JOINT-F1	0.598	# 34
Question Answering	HotpotQA	Baseline Model	ANS-EM	0.240	# 67
Question Answering	HotpotQA	Baseline Model	ANS-F1	0.329	# 68
Question Answering	HotpotQA	Baseline Model	SUP-EM	0.039	# 61
Question Answering	HotpotQA	Baseline Model	SUP-F1	0.377	# 63
Question Answering	HotpotQA	Baseline Model	JOINT-EM	0.019	# 61
Question Answering	HotpotQA	Baseline Model	JOINT-F1	0.162	# 64

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hotpotqa-a-dataset-for-diverse-explainable/question-answering-on-hotpotqa)](https://paperswithcode.com/sota/question-answering-on-hotpotqa?p=hotpotqa-a-dataset-for-diverse-explainable)`

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

EMNLP 2018 · Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, Christopher D. Manning ·

Existing question answering (QA) datasets fail to train QA systems to perform complex reasoning and provide explanations for answers. We introduce HotpotQA, a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems' ability to extract relevant facts and perform necessary comparison. We show that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.

PDF Abstract EMNLP 2018 PDF EMNLP 2018 Abstract

Code

Add Remove Mark official

hotpotqa/hotpot official

401

Tasks

Add Remove

Multi-hop Question Answering

Question Answering

Sentence

Datasets

Introduced in the Paper:

HotpotQA

Used in the Paper:

SQuAD

TriviaQA

SearchQA

Results from the Paper

Edit

Ranked #34 on Question Answering on HotpotQA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Question Answering	HotpotQA	SAFSR model	ANS-EM	0.589	# 30	Compare
			ANS-F1	0.716	# 31	Compare
			SUP-EM	0.480	# 31	Compare
			SUP-F1	0.757	# 30	Compare
			JOINT-EM	0.345	# 29	Compare
			JOINT-F1	0.598	# 34	Compare
Question Answering	HotpotQA	Baseline Model	ANS-EM	0.240	# 67	Compare
			ANS-F1	0.329	# 68	Compare
			SUP-EM	0.039	# 61	Compare
			SUP-F1	0.377	# 63	Compare
			JOINT-EM	0.019	# 61	Compare
			JOINT-F1	0.162	# 64	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove