Large pretrained models have seen enormous success in extractive summarization tasks.
We examine the effect of domain-specific external knowledge variations on deep large scale language model performance.
Many recalibration methods have been proposed in the literature for quantifying predictive uncertainty and calibrating model outputs, with varying degrees of complexity.
Such information is even more important for story visualization since its inputs have an explicit narrative structure that needs to be translated into an image sequence (or visual story).
In this paper, we explore the Navigation from Dialogue History (NDH) task, which is based on the Cooperative Vision-and-Dialogue Navigation (CVDN) dataset, and present a state-of-the-art model which is built upon Vision-Language transformers.
Motivated by the failure of a Transformer model on the SCAN compositionality challenge (Lake and Baroni, 2018), which requires parsing a command into actions, we propose two auxiliary sequence prediction tasks as additional training supervision.
In this work, we propose a continual few-shot learning (CFL) task, in which a system is challenged with a difficult phenomenon and asked to learn to correct mistakes with only a few (10 to 15) training examples.
In this work, we train a policy that learns the generation order for a pre-trained, undirected translation model via reinforcement learning.
In this paper, we discuss approaches to detecting when models have beliefs about the world, and we improve on methods for updating model beliefs to be more truthful, with a focus on methods based on learned optimizers or hypernetworks.
We propose an objective function, Expected Minimum Cost (EMC), based on two key ideas: (1) when presenting a set of options to a user, it is vital that there is at least one low-cost solution the user could adopt; (2) when we do not know the user's true cost function, we can approximately optimize for user satisfaction by first sampling plausible cost functions, then finding a set that achieves a good cost for the user in expectation.
Prior work in this domain has shown that there is ample room for improvement in the generated image sequence in terms of visual quality, consistency and relevance.
Motivated by the failure of a Transformer model on the SCAN compositionality challenge (Lake and Baroni, 2018), which requires parsing a command into actions, we propose two auxiliary sequence prediction tasks that track the progress of function and argument semantics, as additional training supervision.
We introduce iFacetSum, a web application for exploring topical document sets.
In this work, we propose flexible semiautomatic to automatic summary evaluation metrics, following the Pyramid human evaluation method.
Recent years have witnessed various types of generative models for natural language generation (NLG), especially RNNs or transformer based sequence-to-sequence models, as well as variational autoencoder (VAE) and generative adversarial network (GAN) based models.
To defend against machine-generated fake news, an effective mechanism is urgently needed.
Furthermore, we find that widely used automatic evaluation metrics (ROUGE, BERTScore) are weakly correlated with human judgments on this email thread summarization task.
Ranked #1 on Email Thread Summarization on EmailSum (short)
The quantitative evaluation demonstrates that our backbone translation models achieve state-of-the-art translation performance and our quality estimation well correlates with both BLEU and human judgment.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w. r. t.
Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world.
Ranked #2 on Visual Entailment on SNLI-VE val (using extra training data)
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
Unlike language, where the text tokens are more independent, neighboring video tokens typically have strong correlations (e. g., consecutive video frames usually look very similar), and hence uniformly masking individual tokens will make the task too trivial to learn useful representations.
Ranked #2 on Action Recognition on Diving-48
NLP has achieved great progress in the past decade through the use of neural models and large labeled datasets.
1 code implementation • 8 Jun 2021 • Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai, Yu Cheng, Luowei Zhou, Xin Eric Wang, William Yang Wang, Tamara Lee Berg, Mohit Bansal, Jingjing Liu, Lijuan Wang, Zicheng Liu
Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task.
In order to jointly learn from all proof graphs and exploit the correlations between multiple proofs for a question, we pose this task as a set generation problem over structured output spaces where each proof is represented as a directed graph.
On several syntactic and semantic probing tasks, we demonstrate the emergent structural information in the role vectors and improved syntactic interpretability in the TPR layer outputs.
We also show improvements in a transfer-only setup on the DUC-2004 dataset.
In this paper, we study several under-explored dimensions of FI explanations, providing conceptual and empirical improvements for this form of explanation.
Second, to alleviate the temporal misalignment issue, our method incorporates an entropy minimization-based constrained attention loss, to encourage the model to automatically focus on the correct caption from a pool of candidate ASR captions.
In this paper, we develop an end-to-end evaluation framework for interactive summarization, focusing on expansion-based interaction, which considers the accumulating information along a user session.
Therefore, we also provide an exploration of evaluation metrics for the model, focused on aspects of the generated frames such as the presence/quality of generated characters, the relevance to captions, and the diversity of the generated images.
For knowledge integration to yield peak performance, it is critical to select a knowledge graph (KG) that is well-aligned with the given task's objective.
Using the observations and experimental results, we provide practical suggestions on how to create more reliable datasets for the unreliable news detection task.
One key challenge in this task is to ground instructions with the current visual information that the agent perceives.
We show that MC Dropout is able to achieve decent performance without any distribution annotations while Re-Calibration can further give substantial improvements when extra distribution annotations are provided, suggesting the value of multiple annotations for the example in modeling the distribution of human judgements.
Recent commonsense-reasoning tasks are typically discriminative in nature, where a model answers a multiple-choice question for a certain context.
no code implementations • • Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, Adina Williams
We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking.
During the correctional-captioning task, models must generate descriptions of how to move from the current to target pose image, whereas in the retrieval task, models should select the correct target pose given the initial pose and correctional description.
Since these are sequence models, we must choose an ordering of the objects in the graphics programs for likelihood training.
The progress in Query-focused Multi-Document Summarization (QMDS) has been limited by the lack of sufficient largescale high-quality training datasets.
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end-to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full-length videos, proving the proverbial less-is-more principle.
Ranked #2 on Visual Question Answering on MSRVTT-QA (using extra training data)
On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models.
In order to carefully control important properties of the data and explanations, we introduce a synthetic dataset for experiments, and we also make use of three existing datasets with explanations: e-SNLI, TACRED, and SemEval.
Despite impressive performance on standard benchmarks, deep neural networks are often brittle when deployed in real-world systems.
With the availability of the fast influence functions, we demonstrate their usefulness in four applications.
To quantify how well natural language understanding models can capture consistency in a general conversation, we introduce the DialoguE COntradiction DEtection task (DECODE) and a new conversational dataset containing both human-human and human-bot contradictory dialogues.
Given the increasingly prominent role NLP models (will) play in our lives, it is important for human expectations of model behavior to align with actual model behavior.
During this task, the agent (similar to a PokeMON GO player) is asked to find and collect different target objects one-by-one by navigating based on natural language instructions in a complex, realistic outdoor environment, but then also ARRAnge the collected objects part-by-part in an egocentric grid-layout environment.
Further, it is important to consider using a dynamic combination and curriculum of metric rewards that flexibly changes over time.
We introduce HoVer (HOppy VERification), a dataset for many-hop evidence extraction and fact verification.
Architecture search is the automatic process of designing the model or cell structure that is optimal for the given dataset or task.
Reasoning about conjuncts in conjunctive sentences is important for a deeper understanding of conjunctions in English and also how their usages and semantics differ from conjunctive and disjunctive boolean logic.
We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora.
To help save this endangered language, we introduce ChrEn, a Cherokee-English parallel dataset, to facilitate machine translation research between Cherokee and English.
We provide code for the experiments in this paper at https://github. com/peterbhase/LAS-NL-Explanations
Analysis reveals that: (1) high human disagreement exists in a noticeable amount of examples in these datasets; (2) the state-of-the-art models lack the ability to recover the distribution over human labels; (3) models achieve near-perfect accuracy on the subset of data with a high level of human agreement, whereas they can barely beat a random guess on the data with low levels of human agreement, which compose most of the common errors made by state-of-the-art models on the evaluation sets.
First, PROVER generates proofs with an accuracy of 87%, while retaining or improving performance on the QA task, compared to RuleTakers (up to 6% improvement on zero-shot evaluation).
Allowing users to interact with multi-document summarizers is a promising direction towards improving and customizing summary results.
Aligning sentences in a reference summary with their counterparts in source documents was shown as a useful auxiliary summarization task, notably for generating training data for salience detection.
Automatic fact checking is an important task motivated by the need for detecting and preventing the spread of misinformation across the web.
Moreover, our model is also comprised of dual-level attention (word/object and frame level), multi-head self/cross-integration for different sources (video and dense captions), and gates which pass more relevant information to the classifier.
Generating multi-sentence descriptions for videos is one of the most challenging captioning tasks due to its high requirements for not only visual relevance but also discourse-based coherence across the sentences in the paragraph.
Ranked #2 on Video Captioning on ActivityNet Captions
While deep learning models are making fast progress on the task of Natural Language Inference, recent studies have also shown that these models achieve high accuracy by exploiting several dataset biases, and without deep understanding of the language semantics.
Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions, explore the given environments, and reach the desired target locations.
Through two kinds of simulation tests involving text and tabular data, we evaluate five explanations methods: (1) LIME, (2) Anchor, (3) Decision Boundary, (4) a Prototype model, and (5) a Composite approach that combines explanations from each method.
For the first question, we conduct a thorough empirical study over analysis sets and find that in addition to the unstable final performance, the instability exists all along the training curve.
In this work, we present several effective adversaries and automated data augmentation policy search methods with the goal of making reading comprehension models more robust to adversarial evaluation, but also improving generalization to the source domain as well as new domains and languages.
The queries are also labeled with query types that indicate whether each of them is more related to video or subtitle or both, allowing for in-depth analysis of the dataset and the methods that built on top of it.
Ranked #2 on Video Retrieval on TVR
In our work, we build dialogue models that are dynamically aware of what utterances or tokens are dull without any feature-engineering.
Next, we develop a DistanceNet model which uses these distance measures, or a mixture of these distance measures, as an additional loss function to be minimized jointly with the task's loss function, so as to achieve better unsupervised domain adaptation.
We introduce a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure.
Automatic data augmentation (AutoAugment) (Cubuk et al., 2019) searches for optimal perturbation policies via a controller trained using performance rewards of a sampled policy on the target task, hence reducing data-level model bias.
In this work, we give general guidelines on system design for MRS by proposing a simple yet effective pipeline system with special consideration on hierarchical semantic retrieval at both paragraph and sentence level, and their potential effects on the downstream task.
Ranked #40 on Question Answering on HotpotQA
Second, since the traditional evaluation metrics (e. g., BLEU) often fall short in evaluating the quality of generated questions, we propose a QA-based evaluation method which measures the QG model's ability to mimic human annotators in generating QA training data.
In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.
Ranked #1 on Visual Question Answering on VizWiz 2018
To push forward the research in this direction, we first introduce a new language-guided image editing dataset that contains a large number of real image pairs with corresponding editing instructions.
After adversarial training, the baseline's performance improves but is still limited on the adversarial evaluation.
These paragraph captions can hence contain substantial information of the image for tasks such as visual question answering.
Multi-hop reading comprehension requires the model to explore and connect relevant information from multiple sentences/documents in order to answer the question about the context.
We present a PaperRobot who performs as an automatic research assistant by (1) conducting deep understanding of a large collection of human-written papers in a target domain and constructing comprehensive background knowledge graphs (KGs); (2) creating new ideas by predicting links from the background KGs, by combining graph attention and contextual text attention; (3) incrementally writing some key elements of a new paper based on memory-attention networks: from the input title along with predicted related entities to generate a paper abstract, from the abstract to generate conclusion and future work, and finally from future work to generate a title for a follow-on paper.
Our results show the feasibility of a robot learning commonsense knowledge automatically from web-based textual corpora, and the power of learned commonsense reasoning models in enabling a robot to autonomously perform tasks based on incomplete natural language instructions.
We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people and objects) to answer natural language questions about videos.
Ranked #3 on Video Question Answering on TVQA
Conducting a manual evaluation is considered an essential part of summary evaluation methodology.
To address this, we propose a modular architecture composed of a program generator, a controller, a navigator, and a VQA module.
Next, we apply semi-supervised learning (via back-translation) on these dropped-out environments to generate new paths and instructions.
Ranked #1 on Vision-Language Navigation on Room2Room
To address these issues, we present AutoSeM, a two-stage MTL pipeline, where the first stage automatically selects the most useful auxiliary tasks via a Beta-Bernoulli multi-armed bandit with Thompson Sampling, and the second stage learns the training mixing ratio of these selected auxiliary tasks via a Gaussian Process based Bayesian optimization framework.
Therefore, we propose a compositionality-sensitivity testing setup that analyzes models on natural examples from existing datasets that cannot be solved via lexical features alone (i. e., on which a bag-of-words model gives a high probability to one wrong label), hence revealing the models' actual compositionality awareness.
The increasing concern with misinformation has stimulated research efforts on automatic fact checking.
We develop an approach that uses video meta-data to retrieve topically related news documents for a video and extracts the events and named entities from these documents.
We instead focus on a more challenging multi-hop generative task (NarrativeQA), which requires the model to reason, gather, and synthesize disjoint pieces of information within the context to generate an answer.
Ranked #6 on Question Answering on NarrativeQA
A good neural sequence-to-sequence summarization model should have a strong encoder that can distill and memorize the important information from long input texts so that the decoder can generate salient summaries based on the encoder's memory.
Ranked #27 on Abstractive Text Summarization on CNN / Daily Mail
We present two categories of model-agnostic adversarial strategies that reveal the weaknesses of several generative, task-oriented dialogue models: Should-Not-Change strategies that evaluate over-sensitivity to small and semantics-preserving edits, as well as Should-Change strategies that test if a model is over-stable against subtle yet semantics-changing modifications.
Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks.
Ranked #3 on Video Question Answering on SUTD-TrafficQA
In this work, we first present a strong pointer-copy mechanism based sequence-to-sequence sentence simplification model, and then improve its entailment and paraphrasing capabilities via multi-task learning with related auxiliary tasks of entailment and paraphrase generation.
Ranked #2 on Text Simplification on Newsela
Inspired by how humans summarize long documents, we propose an accurate and fast summarization model that first selects salient sentences and then rewrites them abstractively (i. e., compresses and paraphrases) to generate a concise overall summary.
Ranked #7 on Text Summarization on CNN / Daily Mail (Anonymized)
An accurate abstractive summary of a document should contain all its salient information and should be logically entailed by the input document.
Ranked #27 on Text Summarization on GigaWord
Visual reasoning with compositional natural language instructions, e. g., based on the newly-released Cornell Natural Language Visual Reasoning (NLVR) dataset, is a challenging task, where the model needs to have the ability to create an accurate mapping between the diverse phrases and the several objects placed in complex arrangements in the image.
It is shown that many published models for the Stanford Question Answering Dataset (Rajpurkar et al., 2016) lack robustness, suffering an over 50% decrease in F1 score during adversarial evaluation based on the AddSent (Jia and Liang, 2017) algorithm.
More importantly, we next interpret what these neural models have learned about the linguistic characteristics of AD patients, via analysis based on activation clustering and first-derivative saliency techniques.
Abstractive text summarization is the task of compressing and rewriting a long document into a short summary while maintaining saliency, directed logical entailment, and non-redundancy.
Ranked #28 on Abstractive Text Summarization on CNN / Daily Mail
In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression.
Ranked #6 on Referring Expression Segmentation on RefCOCO+ testA
Abstractive summarization, the task of rewriting and compressing a document into a short summary, has achieved considerable success with neural sequence-to-sequence models.
For this task, we make use of the Visual Storytelling dataset and a model composed of three hierarchically-attentive Recurrent Neural Nets (RNNs) to: encode the album photos, select representative (summary) photos, and compose the story.
Sequence-to-sequence models have shown promising improvements on the temporal task of video captioning, but they optimize word-level cross-entropy loss during training.
We present a simple sequential sentence encoder for multi-domain natural language inference.
Ranked #58 on Natural Language Inference on SNLI
Models that can execute natural language instructions for situated robotic tasks such as assembly and navigation have several useful applications in homes, offices, and remote scenarios.
We generate a factor graph from natural language instructions called the Dynamic Grounding Graph (DGG), which takes latent parameters into account.
In conversational speech, the acoustic signal provides cues that help listeners disambiguate difficult parses.
Video captioning, the task of describing the content of a video, has seen some promising improvements in recent years with sequence-to-sequence models, but accurately learning the temporal and logical dynamics involved in the task still remains a challenge, especially given the lack of sufficient annotated data.
The speaker generates referring expressions, the listener comprehends referring expressions, and the reinforcer introduces a reward function to guide sampling of more discriminative expressions.
We first decide which information to share with the user according to their preferences, using a policy trained from human demonstrations via inverse reinforcement learning.
Understanding, predicting, and generating object motions and transformations is a core problem in artificial intelligence.
We have constructed a new "Who-did-What" dataset of over 200, 000 fill-in-the-gap (cloze) multiple choice reading comprehension problems constructed from the LDC English Gigaword newswire corpus.
We present Charagram embeddings, a simple approach for learning character-based compositional models to embed textual sequences.
Temporal common sense has applications in AI tasks such as QA, multi-document summarization, and human-AI communication.
We introduce the novel problem of determining the relevance of questions to images in VQA.
We provide the first extensive evaluation of how using different types of context to learn skip-gram word embeddings affects performance on a wide range of intrinsic and extrinsic NLP tasks.
We present a novel end-to-end neural model to extract entities and relations between them.
Ranked #1 on Relation Extraction on ACE 2005 (Sentence Encoder metric)
We collect two datasets of abstract scenes that facilitate the study of humor at both the scene-level and the object-level.
We again find that the word averaging models perform well for sentence similarity and entailment, outperforming LSTMs.
In this paper, we present a multimodal learning framework that incorporates both visual and lingual information to estimate the structure and parameters that define kinematic models of articulated objects.
We consider the supervised training setting in which we learn task-specific word embeddings.
We propose an end-to-end, domain-independent neural encoder-aligner-decoder model for selective generation, i. e., the joint task of content selection and surface realization.
We propose a neural sequence-to-sequence model for direction following, a task that is essential to realizing effective autonomous agents.
The Paraphrase Database (PPDB; Ganitkevitch et al., 2013) is an extensive semantic resource, consisting of a list of phrase pairs with (heuristic) confidence estimates.
We develop novel first- and second-order features for dependency parsing based on the Google Syntactic Ngrams corpus, a collection of subtree counts of parsed sentences from scanned books.
Word sense induction (WSI) seeks to automatically discover the senses of a word in a corpus via unsupervised methods.
Ranked #5 on Word Sense Induction on SemEval 2013
In this paper we exploit natural sentential descriptions of RGB-D scenes in order to improve 3D semantic parsing.