no code implementations • 28 Dec 2023 • Houlun Chen, Xin Wang, Hong Chen, Zihan Song, Jia Jia, Wenwu Zhu
To tackle these challenges, in this work we propose a Grounding-Prompter method, which is capable of conducting TSG in long videos through prompting LLM with multimodal information.
no code implementations • 21 Dec 2023 • Wei Feng, Xin Wang, Hong Chen, Zeyang Zhang, Zihan Song, Yuwei Zhou, Wenwu Zhu
Recently, researchers have attempted to investigate the capability of LLMs in handling videos and proposed several video LLM models.
1 code implementation • 30 Nov 2023 • Bin Huang, Xin Wang, Hong Chen, Zihan Song, Wenwu Zhu
Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details.
Dense Video Captioning Video-based Generative Performance Benchmarking (Consistency) +5