Previous existing visual question answering (VQA) systems commonly use graph neural networks(GNNs) to extract visual relationships such as semantic relations or spatial relations.
In this paper, we mainly discuss about our submission to MultiDoc2Dial task, which aims to model the goal-oriented dialogues grounded in multiple documents.
In this paper, we propose an evaluation metric for image captioning systems using both image and text information.
However, these language models inevitably utilize unnecessary large-scale model parameters, even when they are used for only a specific task.
To this end, the latest approach is to train a factual consistency classifier on factually consistent and inconsistent summaries.
In this paper, we propose an efficient factual error correction system RFEC based on entities retrieval post-editing process.
Specifically, we employ a two-stage augmentation pipeline to generate new claims and evidences from existing samples.
Also, we observe critical problems of the previous benchmark dataset (i. e., human annotations) on image captioning metric, and introduce a new collection of human annotations on the generated captions.
To evaluate our metric, we create high-quality human judgments of correctness on two GenQA datasets.
Audio Visual Scene-aware Dialog (AVSD) is the task of generating a response for a question with a given scene, video, audio, and the history of previous turns in the dialog.
In this work, we explore the impact of visual modality in addition to speech and text for improving the accuracy of the emotion detection system.
Previous NQG models suffer from a problem that a significant proportion of the generated questions include words in the question target, resulting in the generation of unintended questions.
In this paper, we propose an attention-based classifier that predicts multiple emotions of a given sentence.