Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this work addresses the underexplored field of video-based conversation by introducing Video-ChatGPT. It is a multimodal model that merges a video-adapted visual encoder with a LLM. The model is capable of understanding and generating human-like conversations about videos. We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT acquired via manual and semi-automated pipeline that is easily scalable and robust to label noise. We also develop a quantiative evaluation framework for video-based dialogue models to objectively analyse the strengths and weaknesses of proposed models. Our code, models, instruction-sets and demo are released at https://github.com/mbzuai-oryx/Video-ChatGPT.
PDF AbstractCode
Tasks
Datasets
Introduced in the Paper:
VideoInstructUsed in the Paper:
ActivityNet-QA TGIF-QA MSRVTT-QA MSVD-QA MVBenchTask | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Video Question Answering | ActivityNet-QA | Video-ChatGPT | Accuracy | 35.2 | # 27 | |
Confidence score | 2.7 | # 8 | ||||
Zero-Shot Video Question Answer | ActivityNet-QA | Video-ChatGPT | Confidence Score | 2.7 | # 13 | |
Accuracy | 35.2 | # 13 | ||||
Zero-Shot Video Question Answer | MSRVTT-QA | Video-ChatGPT-7B | Accuracy | 49.3 | # 16 | |
Confidence Score | 2.8 | # 14 | ||||
Zero-Shot Video Question Answer | MSVD-QA | Video-ChatGPT-7B | Accuracy | 64.9 | # 12 | |
Confidence Score | 3.3 | # 11 | ||||
Video Question Answering | MVBench | Video-ChatGPT | Avg. | 32.7 | # 8 | |
Zero-Shot Video Question Answer | TGIF-QA | Video-ChatGPT-7B | Accuracy | 51.4 | # 5 | |
Confidence Score | 3.0 | # 5 | ||||
Video-based Generative Performance Benchmarking (Correctness of Information) | VideoInstruct | Video-ChatGPT | gpt-score | 2.40 | # 7 | |
Video-based Generative Performance Benchmarking | VideoInstruct | Video-ChatGPT | Correctness of Information | 2.4 | # 11 | |
Detail Orientation | 2.52 | # 11 | ||||
Contextual Understanding | 2.62 | # 12 | ||||
Temporal Understanding | 1.98 | # 12 | ||||
Consistency | 2.37 | # 11 | ||||
mean | 2.38 | # 12 | ||||
Video-based Generative Performance Benchmarking (Temporal Understanding) | VideoInstruct | Video-ChatGPT | gpt-score | 1.98 | # 8 | |
Video-based Generative Performance Benchmarking (Detail Orientation)) | VideoInstruct | Video-ChatGPT | gpt-score | 2.52 | # 7 | |
Video-based Generative Performance Benchmarking (Contextual Understanding) | VideoInstruct | Video-ChatGPT | gpt-score | 2.62 | # 8 | |
Video-based Generative Performance Benchmarking (Consistency) | VideoInstruct | Video-ChatGPT | gpt-score | 2.37 | # 7 |