VideoChat: Chat-Centric Video Understanding

10 May 2023  ·  Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, LiMin Wang, Yu Qiao ·

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our code and data at https://github.com/OpenGVLab/Ask-Anything

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video Question Answering ActivityNet-QA Video Chat Accuracy 26.5 # 31
Confidence score 2.2 # 10
Zero-Shot Video Question Answer ActivityNet-QA Video Chat Confidence Score 2.2 # 18
Accuracy 26.5 # 19
Zero-Shot Video Question Answer MSRVTT-QA Video Chat-7B Accuracy 45.0 # 21
Confidence Score 2.5 # 20
Zero-Shot Video Question Answer MSVD-QA Video Chat-7B Accuracy 56.3 # 17
Confidence Score 2.8 # 17
Video Question Answering MVBench VideoChat Avg. 35.5 # 8
Question Answering NExT-QA (Open-ended VideoQA) VideoChat Accuracy 56.6 # 3
Confidence Score 3.2 # 3
Zero-Shot Video Question Answer TGIF-QA Video Chat-7B Accuracy 34.4 # 9
Confidence Score 2.3 # 7
Video-based Generative Performance Benchmarking VideoInstruct Video Chat Correctness of Information 2.23 # 15
Detail Orientation 2.50 # 15
Contextual Understanding 2.53 # 16
Temporal Understanding 1.94 # 17
Consistency 2.24 # 15
mean 2.29 # 16
Video-based Generative Performance Benchmarking (Detail Orientation)) VideoInstruct Video Chat gpt-score 2.50 # 11
Video-based Generative Performance Benchmarking (Temporal Understanding) VideoInstruct Video Chat gpt-score 1.94 # 13
Video-based Generative Performance Benchmarking (Contextual Understanding) VideoInstruct Video Chat gpt-score 2.53 # 12
Video-based Generative Performance Benchmarking (Consistency) VideoInstruct Video Chat gpt-score 2.24 # 11
Video-based Generative Performance Benchmarking (Correctness of Information) VideoInstruct Video Chat gpt-score 2.32 # 11

Methods


No methods listed for this paper. Add relevant methods here