Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this work addresses the underexplored field of video-based conversation by introducing Video-ChatGPT. It is a multimodal model that merges a video-adapted visual encoder with a LLM. The model is capable of understanding and generating human-like conversations about videos. We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT acquired via manual and semi-automated pipeline that is easily scalable and robust to label noise. We also develop a quantiative evaluation framework for video-based dialogue models to objectively analyse the strengths and weaknesses of proposed models. Our code, models, instruction-sets and demo are released at https://github.com/mbzuai-oryx/Video-ChatGPT.
PDF AbstractCode
Tasks








Datasets
Results from the Paper
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Zero-Shot Video Question Answer | ActivityNet-QA | Video-ChatGPT | 1:1 Accuracy | 35.2 | # 2 | |
Score | 2.7 | # 2 | ||||
Zero-Shot Video Question Answer | MSRVTT-QA | Video-ChatGPT | 1:1 Accuracy | 49.3 | # 2 | |
Score | 2.8 | # 1 | ||||
Zero-Shot Video Question Answer | MSVD-QA | Video-ChatGPT | 1:1 Accuracy | 64.9 | # 1 | |
Score | 3.3 | # 1 | ||||
Zero-Shot Video Question Answer | TGIF-QA | Video-ChatGPT | 1:1 Accuracy | 51.4 | # 1 | |
Score | 3.0 | # 1 | ||||
Video-based Generative Performance Benchmarking (Correctness of Information) | VideoInstruct | Video-ChatGPT | gpt-score | 2.40 | # 1 | |
Video-based Generative Performance Benchmarking (Temporal Understanding) | VideoInstruct | Video-ChatGPT | gpt-score | 1.98 | # 1 | |
Video-based Generative Performance Benchmarking | VideoInstruct | Video-ChatGPT | Correctness of Information | 2.4 | # 1 | |
Detail Orientation | 2.52 | # 1 | ||||
Contextual Understanding | 2.62 | # 1 | ||||
Temporal Understanding | 1.98 | # 1 | ||||
Consistency | 2.37 | # 1 | ||||
Video-based Generative Performance Benchmarking (Detail Orientation)) | VideoInstruct | Video-ChatGPT | gpt-score | 2.52 | # 1 | |
Video-based Generative Performance Benchmarking (Contextual Understanding) | VideoInstruct | Video-ChatGPT | gpt-score | 2.62 | # 1 | |
Video-based Generative Performance Benchmarking (Consistency) | VideoInstruct | Video-ChatGPT | gpt-score | 2.37 | # 1 |