To build robust question answering systems, we need the ability to verify whether answers to questions are truly correct, not just “good enough” in the context of imperfect QA datasets.
Generating a summary from findings has been recently explored (Zhang et al., 2018, 2020) in note types such as radiology reports that typically have short length.
We present a series of programming assignments, adaptable to a range of experience levels from advanced undergraduate to PhD, to teach students design and implementation of modern NLP systems.
Very large language models such as GPT-3 have shown impressive performance across a wide variety of tasks, including text summarization.
By perturbing explanations on three controlled tasks, we show that both factors contribute to the effectiveness of explanations, indicating that LLMs do faithfully follow the explanations to some extent.
A growing body of work studies how to answer a question or verify a claim by generating a natural language "proof": a chain of deductive inferences yielding the answer based on a set of premises.
We address the task of predicting out-of-domain (OOD) performance in a few-shot fashion: given a few target-domain examples and a set of models with similar training performance, can we understand how these models will perform on OOD test data?
Despite recent progress in abstractive summarization, models often generate summaries with factual errors.
We develop the first-of-its-kind QUD parser that derives a dependency structure of questions over full documents, trained using a large question-answering dataset DCQA annotated in a manner consistent with the QUD framework.
In this paper, we study its impact on text summarization, focusing on the classic benchmark domain of news summarization.
The propensity of abstractive summarization systems to make factual errors has been the subject of significant study, including work on models to detect factual errors and annotation of errors in current systems' outputs.
In this work, we introduce SNaC, a narrative coherence evaluation framework rooted in fine-grained annotations for long summaries.
Verifying complex political claims is a challenging task, especially when politicians use various tactics to subtly misrepresent the facts.
Given its wide coverage on entity knowledge and temporal indexing, our dataset can be used to evaluate LMs and techniques designed to modify or extend their knowledge.
In settings from fact-checking to question answering, we frequently want to know whether a collection of evidence (premises) entails a hypothesis.
Conditional neural text generation models generate high-quality outputs, but often concentrate around a mode when what we really want is a diverse set of options.
While there has been substantial progress in text comprehension through simple factoid question answering, more holistic comprehension of a discourse still presents a major challenge (Dunietz et al., 2020).
Across different datasets (CNN/DM, XSum, MediaSum) and summary properties, such as abstractiveness and hallucination, we study what the model learns at different stages of its fine-tuning process.
In this paper, we present a unified cross-lingual fine-grained entity typing model capable of handling over 100 languages and analyze this model's ability to generalize to languages and entities unseen during training.
We show that regularization with small amounts of evidence supervision during training can substantially improve the quality of extracted evidence.
Our approach first extracts a set of features combining human intuition about the task with model attributions generated by black box interpretation techniques, then uses a simple calibrator, in the form of a classifier, to predict whether the base model was correct or not.
We introduce CREAK, a testbed for commonsense reasoning about entity knowledge, bridging fact-checking about entities (Harry Potter is a wizard and is skilled at riding a broomstick) with commonsense inferences (if you're good at a skill you can teach others how to do it).
Despite the prominence of neural abstractive summarization models, we know little about how they actually form summaries and how to understand where their decisions come from.
To build robust question answering systems, we need the ability to verify whether answers to questions are truly correct, not just "good enough" in the context of imperfect QA datasets.
Recent pre-trained abstractive summarization systems have started to achieve credible performance, but a major barrier to their use in practice is their propensity to output summaries that are not faithful to the input and that contain factual errors.
When a model attribution technique highlights a particular part of the input, a user might understand this highlight as making a statement about counterfactuals (Miller, 2019): if that part of the input were to change, the model's prediction might change as well.
While numerous methods have been proposed as defenses against adversarial examples in question answering (QA), these techniques are often model specific, require retraining of the model, and give only marginal improvements in performance over vanilla models.
Neural entity typing models typically represent fine-grained entity types as vectors in a high-dimensional space, but such spaces are not well-suited to modeling these types' complex interdependencies.
Ranked #5 on Entity Typing on Open Entity
We propose a single model that addresses both temporal ordering, sorting given events into the order they occurred, and event infilling, predicting new events which fit into an existing temporally-ordered sequence.
A principal barrier to training temporal relation extraction models in new domains is the lack of varied, high quality examples and the challenge of collecting more.
Compressive summarization systems typically rely on a crafted set of syntactic rules to determine what spans of possible summary sentences can be deleted, then learn a model of what to actually delete by optimizing for content selection (ROUGE).
An advantage of seq2seq abstractive summarization models is that they generate text in a free-form manner, but this flexibility makes it difficult to interpret model behavior.
Experiments show that our dependency arc entailment model trained on this data can identify factual inconsistencies in paraphrasing and summarization better than sentence-level methods or those based on question generation, while additionally localizing the erroneous parts of the generation.
Multimodal program synthesis, which leverages different types of user input to synthesize a desired program, is an attractive way to scale program synthesis to challenging settings; however, it requires integrating noisy signals from the user, like natural language, with hard constraints on the program's behavior.
Inquisitive probing questions come naturally to humans in a variety of settings, but is a challenging task for automatic systems.
Current methods in open-domain question answering (QA) usually employ a pipeline of first retrieving relevant documents, then applying strong reading comprehension (RC) models to that retrieved text.
We propose a method for controlled narrative/story generation where we are able to guide the model to produce coherent narratives with user-specified target endings by interpolation: for example, we are told that Jim went hiking and at the end Jim needed to be rescued, and we want the model to incrementally generate steps along the way.
Paraphrasing natural language sentences is a multifaceted process: it might involve replacing individual words or short phrases, local rearrangement of content, or high-level restructuring like topicalization or passivization.
Existing datasets for regular expression (regex) generation from natural language are limited in complexity; compared to regex tasks that users post on StackOverflow, the regexes in these datasets are simple, and the language used to describe them is not diverse.
Current textual question answering models achieve strong performance on in-domain test sets, but often do so by fitting surface-level patterns in the data, so they fail to generalize to out-of-distribution settings.
On entity probing tasks involving recognizing entity identity, our embeddings used in parameter-free downstream models achieve competitive performance with ELMo- and BERT-based embeddings in trained models.
Given this program abstraction, we then use a graph neural network to propagate information between related type variables and eventually make type predictions.
We analyze differences between BPE and unigram LM tokenization, finding that the latter method recovers subword units that align more closely with morphology and avoids problems stemming from BPE's greedy construction procedure.
Pre-trained Transformers are now ubiquitous in natural language processing, but despite their high end-task performance, little is known empirically about whether they are calibrated.
Our analysis shows the properties of chains that are crucial for high performance: in particular, modeling extraction sequentially is important, as is dealing with each candidate sentence in a context-aware way.
Ranked #3 on Question Answering on WikiHop
For this problem, a domain is characterized not just by genre of text but even by factors as specific as the particular distribution of entities, as neural models tend to overfit by memorizing properties of frequent entities in a dataset.
Tracking entities in procedural language requires understanding the transformations arising from actions on entities as well as those entities' interactions.
Our system achieves state-of-the-art performance on the prior datasets and solves 57% of the real-world dataset, which existing neural systems completely fail on.
Sequence-to-sequence models for open-domain dialogue generation tend to favor generic, uninformative responses.
In this work, we propose a two-stage procedure for handling this type of data: denoise it with a learned model, then train our final model on clean and denoised distant data with standard supervised training.
Ranked #2 on Entity Typing on Ontonotes v5 (English)
First, we explore sentence-factored models for these tasks; by design, these models cannot do multi-hop reasoning, but they are still able to solve a large number of examples in both datasets.
The global discrete state structure is explicitly modeled with a neural CRF over the changing hidden representation of the entity.
Ranked #2 on Procedural Text Understanding on ProPara
In this work, we present a neural model for single-document summarization based on joint extraction and syntactic compression.
Sentence specificity quantifies the level of detail in a sentence, characterizing the organization of information in discourse.
A hallmark of variational autoencoders (VAEs) for text processing is their combination of powerful encoder-decoder models, such as LSTMs, with simple latent distributions, typically multivariate Gaussians.
One weakness of machine-learned NLP models is that they typically perform poorly on out-of-domain data.
A key challenge in entity linking is making effective use of contextual information to disambiguate mentions that might refer to different entities in different contexts.
We present a discriminative model for single-document summarization that integrally combines compression and anaphoricity constraints.
We present a joint model of three core tasks in the entity analysis stack: coreference resolution (within-document clustering), named entity recognition (coarse semantic typing), and entity linking (matching to Wikipedia entities).
Ranked #26 on Named Entity Recognition on Ontonotes v5 (English)