VideoChat: Chat-Centric Video Understanding

10 May 2023  ยท  Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, LiMin Wang, Yu Qiao ยท

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our code and data at https://github.com/OpenGVLab/Ask-Anything

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video Question Answering ActivityNet-QA Video Chat Accuracy 26.5 # 31
Confidence score 2.2 # 10
Zero-Shot Video Question Answer ActivityNet-QA Video Chat Confidence Score 2.2 # 16
Accuracy 26.5 # 17
Zero-Shot Video Question Answer MSRVTT-QA Video Chat-7B Accuracy 45.0 # 19
Confidence Score 2.5 # 18
Zero-Shot Video Question Answer MSVD-QA Video Chat-7B Accuracy 56.3 # 15
Confidence Score 2.8 # 15
Video Question Answering MVBench VideoChat Avg. 35.5 # 7
Question Answering NExT-QA (Open-ended VideoQA) VideoChat Accuracy 56.6 # 1
Confidence Score 3.2 # 1
Zero-Shot Video Question Answer TGIF-QA Video Chat-7B Accuracy 34.4 # 9
Confidence Score 2.3 # 7
Video-based Generative Performance Benchmarking VideoInstruct Video Chat Correctness of Information 2.23 # 13
Detail Orientation 2.50 # 13
Contextual Understanding 2.53 # 14
Temporal Understanding 1.94 # 15
Consistency 2.24 # 13
mean 2.29 # 14
Video-based Generative Performance Benchmarking (Detail Orientation)) VideoInstruct Video Chat gpt-score 2.50 # 10
Video-based Generative Performance Benchmarking (Temporal Understanding) VideoInstruct Video Chat gpt-score 1.94 # 12
Video-based Generative Performance Benchmarking (Contextual Understanding) VideoInstruct Video Chat gpt-score 2.53 # 11
Video-based Generative Performance Benchmarking (Consistency) VideoInstruct Video Chat gpt-score 2.24 # 10
Video-based Generative Performance Benchmarking (Correctness of Information) VideoInstruct Video Chat gpt-score 2.32 # 10

Methods


No methods listed for this paper. Add relevant methods here