In this paper, we propose an evaluation metric for image captioning systems using both image and text information.
Our method directly optimizes CKA to make an alignment between video and text embedding representations, hence it aids the cross-modality attention module to combine information over different modalities.
In this paper, we mainly discuss about our submission to MultiDoc2Dial task, which aims to model the goal-oriented dialogues grounded in multiple documents.
Previous existing visual question answering (VQA) systems commonly use graph neural networks(GNNs) to extract visual relationships such as semantic relations or spatial relations.
Context-aware neural machine translation (NMT) incorporates contextual information of surrounding texts, that can improve the translation quality of document-level machine translation.
We first present CAMBIGNQ, a dataset consisting of 5, 654 ambiguous questions, each with relevant passages, possible answers, and a clarification question.
Our method is target-language-agnostic and applicable to already trained multilingual machine translation models through post-fine-tuning.
In Task Oriented Dialogue (TOD) system, detecting and inducing new intents are two main challenges to apply the system in the real world.
Vulnerability to lexical perturbation is a critical weakness of automatic evaluation metrics for image captioning.
Also, the objective function of NF makes the model use the variance information and the text in a disentangled manner resulting in more precise variance control.
In this work, we propose a novel critic decoding method for controlled language generation (CriticControl) that combines the strengths of reinforcement learning and weighted decoding.
Then, the attention weights of each modality are applied directly to the other modality in a crossed way, so that the CAN gathers the audio and text information from the same time steps based on each modality.
However, these language models utilize an unnecessarily large number of model parameters, even when used only for a specific task.
To this end, the latest approach is to train a factual consistency classifier on factually consistent and inconsistent summaries.
In this paper, we propose an efficient factual error correction system RFEC based on entities retrieval post-editing process.
Specifically, we employ a two-stage augmentation pipeline to generate new claims and evidences from existing samples.
With the rapid advancement in deep generative models, recent neural text-to-speech models have succeeded in synthesizing human-like speech, even in an end-to-end manner.
We experimented with our method on common context-aware NMT models and two document-level translation tasks.
Source-free domain adaptation is an emerging line of work in deep learning research since it is closely related to the real-world environment.
Also, we observe critical problems of the previous benchmark dataset (i. e., human annotations) on image captioning metric, and introduce a new collection of human annotations on the generated captions.
Logical reasoning tasks over symbols, such as learning arithmetic operations and computer program evaluations, have become challenges to deep learning.
Although early text-to-speech (TTS) models such as Tacotron 2 have succeeded in generating human-like speech, their autoregressive (AR) architectures have a limitation that they require a lot of time to generate a mel-spectrogram consisting of hundreds of steps.
Applying generative adversarial networks (GANs) to text-related tasks is challenging due to the discrete nature of language.
To evaluate our metric, we create high-quality human judgments of correctness on two GenQA datasets.
Even though BERT achieves successful performance improvements in various supervised learning tasks, applying BERT for unsupervised tasks still holds a limitation that it requires repetitive inference for computing contextual language representations.
Audio Visual Scene-aware Dialog (AVSD) is the task of generating a response for a question with a given scene, video, audio, and the history of previous turns in the dialog.
In this study, we develop a novel graph-based framework for ADR signal detection using healthcare claims data.
In digital environments where substantial amounts of information are shared online, news headlines play essential roles in the selection and diffusion of news articles.
In this work, we explore the impact of visual modality in addition to speech and text for improving the accuracy of the emotion detection system.
In this study, we propose a novel graph neural network called propagate-selector (PS), which propagates information over sentences to understand information that cannot be inferred when considering sentences in isolation.
While deep learning techniques have shown promising results in many natural language processing (NLP) tasks, it has not been widely applied to the clinical domain.
This paper describes our system for SemEval-2019 Task 3: EmoContext, which aims to predict the emotion of the third utterance considering two preceding utterances in a dialogue.
In this paper, we propose a novel method for a sentence-level answer-selection task that is a fundamental problem in natural language processing.
Ranked #7 on Question Answering on TrecQA
Recent studies have tried to use bidirectional LMs (biLMs) instead of conventional unidirectional LMs (uniLMs) for rescoring the $N$-best list decoded from the acoustic model.
As opposed to using knowledge from both the modalities separately, we propose a framework to exploit acoustic information in tandem with lexical data.
Some news headlines mislead readers with overrated or false information, and identifying them in advance will better assist readers in choosing proper news stories to consume.
Speech emotion recognition is a challenging task, and extensive reliance has been placed on models that use audio features in building well-performing classifiers.
Previous NQG models suffer from a problem that a significant proportion of the generated questions include words in the question target, resulting in the generation of unintended questions.
We define the complexity and difficulty of a number sequence prediction task with the structure of the smallest automaton that can generate the sequence.
In this paper, we propose an attention-based classifier that predicts multiple emotions of a given sentence.
In this paper, we propose a novel end-to-end neural architecture for ranking candidate answers, that adapts a hierarchical recurrent neural network and a latent topic clustering module.
Ranked #1 on Answer Selection on Ubuntu Dialogue (v1, Ranking)
In this paper, we propose an efficient transfer leaning methods for training a personalized language model using a recurrent neural network with long short-term memory architecture.
This paper presents a novel meta algorithm, Partition-Merge (PM), which takes existing centralized algorithms for graph computation and makes them distributed and faster.
However, for many computer vision problems, the MAP solution under the model is not the ground truth solution.
We show how this constrained discrete optimization problem can be formulated as a multi-dimensional parametric mincut problem via its Lagrangian dual, and prove that our algorithm isolates all constraint instances for which the problem can be solved exactly.
We present a new local approximation algorithm for computing MAP and log-partition function for arbitrary exponential family distribution represented by a finite-valued pair-wise Markov random field (MRF), say G. Our algorithm is based on decomposing G into appropriately chosen small components; computing estimates locally in each of these components and then producing a good global solution.