We evaluate a range of RC models on our evaluation sets, which reveals large performance gaps on generated examples compared to the original data.
In this work, we examine the behaviour of non-target heads, that is, the output of heads when given input that belongs to a different task than the one they were trained for.
A key limitation in current datasets for multi-hop reasoning is that the required steps for answering the question are mentioned in it explicitly.
In this work, we show that numerical reasoning is amenable to automatic data generation, and thus one can inject this skill into pre-trained LMs, by generating large amounts of data, and training in a multi-task setup.
Ranked #6 on Question Answering on DROP Test
Understanding natural language questions entails the ability to break down a question into the requisite steps for computing its answer.
Crowdsourcing has been the prevalent paradigm for creating natural language understanding datasets in recent years.
We author a set of rules for identifying a diverse set of discourse phenomena in raw text, and decomposing the text into two independent sentences.
Training agents to communicate with one another given task-based supervision only has attracted considerable attention recently, due to the growing interest in developing models for human-agent interaction.
Semantic parsing shines at analyzing complex natural language that involves composition and computation over multiple pieces of evidence.
Ranked #2 on Question Answering on COMPLEXQUESTIONS