We show that both transfer learning methods combined achieve the highest ROUGE scores.
We construct and present a new multimodal dataset consisting of software instructional livestreams and containing manual annotations for both detailed and abstract procedural intent that enable training and evaluation of joint video and text understanding models.
In this paper, we propose an evaluation metric for image captioning systems using both image and text information.
Then, the attention weights of each modality are applied directly to the other modality in a crossed way, so that the CAN gathers the audio and text information from the same time steps based on each modality.
However, the progress of learning contextualized phrase embeddings is hindered by the lack of a human-annotated, phrase-in-context benchmark.
Toward more descriptive and distinctive caption generation, we propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
In this paper, we propose an efficient factual error correction system RFEC based on entities retrieval post-editing process.
This study investigates how fake news uses a thumbnail for a news article with a focus on whether a news article's thumbnail represents the news content correctly.
To our knowledge, this is the first dataset that provides conversational image search and editing annotations, where the agent holds a grounded conversation with users and helps them to search and edit images according to their requests.
Acronym extraction is the task of identifying acronyms and their expanded forms in texts that is necessary for various NLP applications.
Recent named entity recognition (NER) models often rely on human-annotated datasets requiring the vast engagement of professional knowledge on the target domain and entities.
In this work, we focus on a more challenging few-shot intent detection scenario where many intents are fine-grained and semantically similar.
Users of medical question answering systems often submit long and detailed questions, making it hard to achieve high recall in answer retrieval.
Also, we observe critical problems of the previous benchmark dataset (i. e., human annotations) on image captioning metric, and introduce a new collection of human annotations on the generated captions.
Applying generative adversarial networks (GANs) to text-related tasks is challenging due to the discrete nature of language.
To evaluate our metric, we create high-quality human judgments of correctness on two GenQA datasets.
Even though BERT achieves successful performance improvements in various supervised learning tasks, applying BERT for unsupervised tasks still holds a limitation that it requires repetitive inference for computing contextual language representations.
In this study, we develop a novel graph-based framework for ADR signal detection using healthcare claims data.
Audio Visual Scene-aware Dialog (AVSD) is the task of generating a response for a question with a given scene, video, audio, and the history of previous turns in the dialog.
In digital environments where substantial amounts of information are shared online, news headlines play essential roles in the selection and diffusion of news articles.
In this work, we explore the impact of visual modality in addition to speech and text for improving the accuracy of the emotion detection system.
In this study, we propose a novel graph neural network called propagate-selector (PS), which propagates information over sentences to understand information that cannot be inferred when considering sentences in isolation.
While deep learning techniques have shown promising results in many natural language processing (NLP) tasks, it has not been widely applied to the clinical domain.
In this paper, we propose a novel method for a sentence-level answer-selection task that is a fundamental problem in natural language processing.
Ranked #2 on Question Answering on TrecQA
As opposed to using knowledge from both the modalities separately, we propose a framework to exploit acoustic information in tandem with lexical data.
Some news headlines mislead readers with overrated or false information, and identifying them in advance will better assist readers in choosing proper news stories to consume.
Speech emotion recognition is a challenging task, and extensive reliance has been placed on models that use audio features in building well-performing classifiers.
In this paper, we propose a novel end-to-end neural architecture for ranking candidate answers, that adapts a hierarchical recurrent neural network and a latent topic clustering module.
Ranked #1 on Answer Selection on Ubuntu Dialogue (v2, Ranking)
In this paper, we propose an efficient transfer leaning methods for training a personalized language model using a recurrent neural network with long short-term memory architecture.