1 code implementation • 10 Aug 2023 • Guangyao Li, Wenxuan Hou, Di Hu
Such naturally multi-modal videos are composed of rich and complex dynamic audio-visual components, where most of which could be unrelated to the given questions, or even play as interference in answering the content of interest.
Ranked #2 on Audio-Visual Question Answering (AVQA) on AVQA
Audio-visual Question Answering Audio-Visual Question Answering (AVQA) +2