1 code implementation • 15 Dec 2023 • Sunjae Yoon, Dahyun Kim, Eunseop Yoon, Hee Suk Yoon, Junyeong Kim, Chnag D. Yoo
Video-grounded Dialogue (VGD) aims to answer questions regarding a given multi-modal input comprising video, audio, and dialogue history.
no code implementations • 12 Dec 2022 • Sunjae Yoon, Eunseop Yoon, Hee Suk Yoon, Junyeong Kim, Chang D. Yoo
Despite the recent success of multi-modal reasoning to generate answer sentences, existing dialogue systems still suffer from a text hallucination problem, which denotes indiscriminate text-copying from input texts without an understanding of the question.
1 code implementation • 17 Oct 2022 • Sunjae Yoon, Ji Woo Hong, Eunseop Yoon, Dahyun Kim, Junyeong Kim, Hee Suk Yoon, Chang D. Yoo
Video moment retrieval (VMR) aims to localize target moments in untrimmed videos pertinent to a given textual query.
1 code implementation • 17 Sep 2022 • Thang Vu, Kookhoi Kim, Tung M. Luu, Thanh Nguyen, Junyeong Kim, Chang D. Yoo
Furthermore, SoftGroup can be extended to perform object detection and panoptic segmentation with nontrivial improvements over existing methods.
no code implementations • 24 Mar 2021 • Junyeong Kim, Sunjae Yoon, Dahyun Kim, Chang D. Yoo
A video-grounded dialogue system referred to as the Structured Co-reference Graph Attention (SCGA) is presented for decoding the answer sequence to a question regarding a given video while keeping track of the dialogue context.
1 code implementation • ECCV 2020 • Minuk Ma, Sunjae Yoon, Junyeong Kim, Young-Joon Lee, Sunghun Kang, Chang D. Yoo
This paper explores methods for performing VMR in a weakly-supervised manner (wVMR): training is performed without temporal moment labels but only with the text query that describes a segment of the video.
no code implementations • CVPR 2020 • Junyeong Kim, Minuk Ma, Trung Pham, Kyung-Su Kim, Chang D. Yoo
To this end, MSAN is based on (1) the moment proposal network (MPN) that attempts to locate the most appropriate temporal moment from each of the modalities, and also on (2) the heterogeneous reasoning network (HRN) that predicts the answer using an attention mechanism on both modalities.
no code implementations • 28 May 2019 • Junyeong Kim, Minuk Ma, Kyung-Su Kim, Sungjin Kim, Chang D. Yoo
This paper proposes a method to gain extra supervision via multi-task learning for multi-modal video question answering.
no code implementations • CVPR 2019 • Junyeong Kim, Minuk Ma, Kyung-Su Kim, Sungjin Kim, Chang D. Yoo
To overcome these challenges, PAMN involves three main features: (1) progressive attention mechanism that utilizes cues from both question and answer to progressively prune out irrelevant temporal parts in memory, (2) dynamic modality fusion that adaptively determines the contribution of each modality for answering the current question, and (3) belief correction answering scheme that successively corrects the prediction score on each candidate answer.
Ranked #2 on Video Story QA on MovieQA
no code implementations • ECCV 2018 • Sunghun Kang, Junyeong Kim, Hyun-Soo Choi, Sungjin Kim, Chang D. Yoo
The architecture is trained to maximizes the correlation between the hidden states as well as the predictions of the modal-agnostic pivot stream and modal-specific stream in the network.