TORQUE: A Reading Comprehension Dataset of Temporal Ordering Questions

A critical part of reading is being able to understand the temporal relationships between events described in a passage of text, even when those relationships are not explicitly stated. However, current machine reading comprehension benchmarks have practically no questions that test temporal phenomena, so systems trained on these benchmarks have no capacity to answer questions such as "what happened before/after [some event]?" We introduce TORQUE, a new English reading comprehension benchmark built on 3.2k news snippets with 21k human-generated questions querying temporal relationships. Results show that RoBERTa-large achieves an exact-match score of 51% on the test set of TORQUE, about 30% behind human performance.

PDF Abstract EMNLP 2020 PDF EMNLP 2020 Abstract

Datasets


Introduced in the Paper:

Torque

Used in the Paper:

SQuAD

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Question Answering Torque RoBERTa-large F1 75.2 # 2
EM 51.1 # 2
C 34.5 # 2

Methods


No methods listed for this paper. Add relevant methods here