1 code implementation • 27 Feb 2023 • Junbin Xiao, Pan Zhou, Angela Yao, Yicong Li, Richang Hong, Shuicheng Yan, Tat-Seng Chua
CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning.
2 code implementations • 26 Jul 2022 • Yicong Li, Xiang Wang, Junbin Xiao, Tat-Seng Chua
Specifically, the equivariant grounding encourages the answering to be sensitive to the semantic changes in the causal scene and question; in contrast, the invariant grounding enforces the answering to be insensitive to the changes in the environment scene.
2 code implementations • 12 Jul 2022 • Junbin Xiao, Pan Zhou, Tat-Seng Chua, Shuicheng Yan
VGT's uniqueness are two-fold: 1) it designs a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations, and dynamics for complex spatio-temporal reasoning; and 2) it exploits disentangled video and text Transformers for relevance comparison between the video and text to perform QA, instead of entangled cross-modal Transformer for answer classification.
Ranked #2 on
Video Question Answering
on NExT-QA
(using extra training data)
1 code implementation • CVPR 2022 • Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, Tat-Seng Chua
At its core is understanding the alignments between visual scenes in video and linguistic semantics in question to yield the answer.
1 code implementation • 2 Mar 2022 • Yaoyao Zhong, Junbin Xiao, Wei Ji, Yicong Li, Weihong Deng, Tat-Seng Chua
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos.
1 code implementation • 12 Dec 2021 • Junbin Xiao, Angela Yao, Zhiyuan Liu, Yicong Li, Wei Ji, Tat-Seng Chua
To align with the multi-granular essence of linguistic concepts in language queries, we propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner, with the guidance of corresponding textual cues.
1 code implementation • CVPR 2021 • Junbin Xiao, Xindi Shang, Angela Yao, Tat-Seng Chua
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark to advance video understanding from describing to explaining the temporal actions.
1 code implementation • 18 May 2021 • Junbin Xiao, Xindi Shang, Angela Yao, Tat-Seng Chua
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark to advance video understanding from describing to explaining the temporal actions.
1 code implementation • ECCV 2020 • Junbin Xiao, Xindi Shang, Xun Yang, Sheng Tang, Tat-Seng Chua
In this paper, we explore a novel task named visual Relation Grounding in Videos (vRGV).