We investigate the extent to which errors of current coreference resolution models are associated with existing differences in operationalization across datasets (OntoNotes, PreCo, and Winogrande).
In this work, we propose a test suite of coreference resolution subtasks that require reasoning over multiple facts.
There are many ways to express similar things in text, which makes evaluating natural language generation (NLG) systems difficult.
On average, a conversation in our dataset spans 13 question-answer turns and involves four topics (documents).
A false contract is more likely to be rejected than a contract is, yet a false key is less likely than a key to open doors.
Understanding natural language requires common sense, one aspect of which is the ability to discern the plausibility of events.
In particular, we demonstrate through a simple consistency probe that the ability to correctly retrieve hypernyms in cloze tasks, as used in prior work, does not correspond to systematic knowledge in BERT.
The Winograd Schema Challenge (WSC) and variants inspired by it have become important benchmarks for common-sense reasoning (CSR).
Previous work has focused specifically on modeling physical plausibility and shown that distributional methods fail when tested in a supervised setting.
In this paper, we propose a method for incorporating world knowledge (linked entities and fine-grained entity types) into a neural question generation model.
We propose a two-agent game wherein a questioner must be able to conjure discerning questions between sentences, incorporate responses from an answerer, and keep track of a hypothesis state.
Recent studies have significantly improved the state-of-the-art on common-sense reasoning (CSR) benchmarks like the Winograd Schema Challenge (WSC) and SWAG.
To explain this performance gap, we show empirically that state-of-the art models often fail to capture context, instead relying on the gender or number of candidate antecedents to make a decision.
We introduce an automatic system that achieves state-of-the-art results on the Winograd Schema Challenge (WSC), a common sense reasoning task that requires diverse, complex forms of inference and knowledge.
We introduce an automatic system that performs well on two common-sense reasoning tasks, the Winograd Schema Challenge (WSC) and the Choice of Plausible Alternatives (COPA).
We developed this dataset to study the role of memory in goal-oriented dialogue systems.
We present NewsQA, a challenging machine comprehension dataset of over 100, 000 human-generated question-answer pairs.
The model takes as input a sequence of dialogue contexts and outputs a sequence of dialogue acts corresponding to user intentions.
Natural language generation plays a critical role in spoken dialogue systems.
Indeed, with only a few hundred dialogues collected with a handcrafted policy, the actor-critic deep learner is considerably bootstrapped from a combination of supervised and batch RL.
We present the EpiReader, a novel model for machine comprehension of text.
Ranked #7 on Question Answering on Children's Book Test
The parallel hierarchy enables our model to compare the passage, question, and answer from a variety of trainable perspectives, as opposed to using a manually designed, rigid feature set.
Ranked #1 on Question Answering on MCTest-160