1 code implementation • 28 Mar 2024 • Yan-Bo Lin, Gedas Bertasius
Our framework uses a single shared vision transformer backbone to process audio and visual inputs, improving its parameter efficiency, reducing the GPU memory footprint, and allowing us to scale our method to larger datasets and model sizes.
1 code implementation • 13 Mar 2024 • Feng Cheng, Ziyang Wang, Yi-Lin Sung, Yan-Bo Lin, Mohit Bansal, Gedas Bertasius
Our DAM model outperforms prior state-of-the-art continual learning approaches by 9. 1% while exhibiting 1. 9% less forgetting on 6 VidQA datasets spanning various domains.
1 code implementation • CVPR 2023 • Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, Gedas Bertasius
To do so, we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained ViTs to audio-visual tasks by injecting a small number of trainable parameters into every layer of a frozen ViT.
Ranked #4 on Audio-visual Question Answering on MUSIC-AVQA
1 code implementation • 6 Apr 2022 • Yan-Bo Lin, Jie Lei, Mohit Bansal, Gedas Bertasius
We introduce an audiovisual method for long-range text-to-video retrieval.
1 code implementation • NeurIPS 2021 • Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee, Yen-Yu Lin, Ming-Hsuan Yang
The audio-visual video parsing task aims to temporally parse a video into audio or visual event categories.
no code implementations • 3 May 2021 • Yan-Bo Lin, Yu-Chiang Frank Wang
Human perceives rich auditory experience with distinct sound heard by ears.
no code implementations • 1 Apr 2021 • Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee, Yen-Yu Lin, Ming-Hsuan Yang
Sound localization aims to find the source of the audio signal in the visual scene.
no code implementations • ICCV 2019 • Yu-Jhe Li, Ci-Siang Lin, Yan-Bo Lin, Yu-Chiang Frank Wang
Person re-identification (re-ID) aims at recognizing the same person from images taken across different cameras.
Ranked #16 on Unsupervised Domain Adaptation on Market to Duke
2 code implementations • 20 Feb 2019 • Yan-Bo Lin, Yu-Jhe Li, Yu-Chiang Frank Wang
Audio-visual event localization requires one to identify theevent which is both visible and audible in a video (eitherat a frame or video level).