1 code implementation • 20 Mar 2024 • Wenqiao Zhang, Tianwei Lin, Jiang Liu, Fangxun Shu, Haoyuan Li, Lei Zhang, He Wanggui, Hao Zhou, Zheqi Lv, Hao Jiang, Juncheng Li, Siliang Tang, Yueting Zhuang
Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks.
Ranked #76 on Visual Question Answering on MM-Vet
no code implementations • 11 Dec 2023 • Lei Zhang, Fangxun Shu, Sucheng Ren, Bingchen Zhao, Hao Jiang, Cihang Xie
The massive growth of image-text data through web crawling inherently presents the challenge of variability in data quality.
no code implementations • 11 Dec 2023 • Fangxun Shu, Lei Zhang, Hao Jiang, Cihang Xie
This paper presents Audio-Visual LLM, a Multimodal Large Language Model that takes both visual and auditory inputs for holistic video understanding.
1 code implementation • 2 Dec 2022 • Fangxun Shu, Biaolong Chen, Yue Liao, Shuwen Xiao, Wenyu Sun, Xiaobo Li, Yousong Zhu, Jinqiao Wang, Si Liu
Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model by a mask sampling mechanism to improve pre-training efficiency.
Ranked #37 on Video Retrieval on MSR-VTT-1kA (using extra training data)