HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Existing question answering (QA) datasets fail to train QA systems to perform complex reasoning and provide explanations for answers. We introduce HotpotQA, a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems' ability to extract relevant facts and perform necessary comparison. We show that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.

PDF Abstract EMNLP 2018 PDF EMNLP 2018 Abstract

Datasets


Introduced in the Paper:

HotpotQA

Used in the Paper:

SQuAD TriviaQA SearchQA

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Question Answering HotpotQA SAFSR model ANS-EM 0.589 # 30
ANS-F1 0.716 # 32
SUP-EM 0.480 # 31
SUP-F1 0.757 # 31
JOINT-EM 0.345 # 29
JOINT-F1 0.598 # 35
Question Answering HotpotQA Baseline Model ANS-EM 0.240 # 66
ANS-F1 0.329 # 68
SUP-EM 0.039 # 62
SUP-F1 0.377 # 64
JOINT-EM 0.019 # 62
JOINT-F1 0.162 # 66

Methods


No methods listed for this paper. Add relevant methods here