TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Question Answering	ActivityNet-QA	Video-ChatGPT	Accuracy	35.2	# 27
Video Question Answering	ActivityNet-QA	Video-ChatGPT	Confidence score	2.7	# 8
Zero-Shot Video Question Answer	ActivityNet-QA	Video-ChatGPT	Confidence Score	2.7	# 13
Zero-Shot Video Question Answer	ActivityNet-QA	Video-ChatGPT	Accuracy	35.2	# 13
Zero-Shot Video Question Answer	MSRVTT-QA	Video-ChatGPT-7B	Accuracy	49.3	# 16
Zero-Shot Video Question Answer	MSRVTT-QA	Video-ChatGPT-7B	Confidence Score	2.8	# 14
Zero-Shot Video Question Answer	MSVD-QA	Video-ChatGPT-7B	Accuracy	64.9	# 12
Zero-Shot Video Question Answer	MSVD-QA	Video-ChatGPT-7B	Confidence Score	3.3	# 11
Video Question Answering	MVBench	Video-ChatGPT	Avg.	32.7	# 8
Zero-Shot Video Question Answer	TGIF-QA	Video-ChatGPT-7B	Accuracy	51.4	# 5
Zero-Shot Video Question Answer	TGIF-QA	Video-ChatGPT-7B	Confidence Score	3.0	# 5
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	Video-ChatGPT	gpt-score	2.40	# 7
Video-based Generative Performance Benchmarking	VideoInstruct	Video-ChatGPT	Correctness of Information	2.4	# 11
Video-based Generative Performance Benchmarking	VideoInstruct	Video-ChatGPT	Detail Orientation	2.52	# 11
Video-based Generative Performance Benchmarking	VideoInstruct	Video-ChatGPT	Contextual Understanding	2.62	# 12
Video-based Generative Performance Benchmarking	VideoInstruct	Video-ChatGPT	Temporal Understanding	1.98	# 12
Video-based Generative Performance Benchmarking	VideoInstruct	Video-ChatGPT	Consistency	2.37	# 11
Video-based Generative Performance Benchmarking	VideoInstruct	Video-ChatGPT	mean	2.38	# 12
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	Video-ChatGPT	gpt-score	1.98	# 8
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	Video-ChatGPT	gpt-score	2.52	# 7
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	Video-ChatGPT	gpt-score	2.62	# 8
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	Video-ChatGPT	gpt-score	2.37	# 7

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-chatgpt-towards-detailed-video/zeroshot-video-question-answer-on-tgif-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-tgif-qa?p=video-chatgpt-towards-detailed-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-chatgpt-towards-detailed-video/video-based-generative-performance-1)](https://paperswithcode.com/sota/video-based-generative-performance-1?p=video-chatgpt-towards-detailed-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-chatgpt-towards-detailed-video/video-based-generative-performance-4)](https://paperswithcode.com/sota/video-based-generative-performance-4?p=video-chatgpt-towards-detailed-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-chatgpt-towards-detailed-video/video-based-generative-performance-2)](https://paperswithcode.com/sota/video-based-generative-performance-2?p=video-chatgpt-towards-detailed-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-chatgpt-towards-detailed-video/video-question-answering-on-mvbench)](https://paperswithcode.com/sota/video-question-answering-on-mvbench?p=video-chatgpt-towards-detailed-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-chatgpt-towards-detailed-video/video-based-generative-performance-5)](https://paperswithcode.com/sota/video-based-generative-performance-5?p=video-chatgpt-towards-detailed-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-chatgpt-towards-detailed-video/video-based-generative-performance-3)](https://paperswithcode.com/sota/video-based-generative-performance-3?p=video-chatgpt-towards-detailed-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-chatgpt-towards-detailed-video/zeroshot-video-question-answer-on-msvd-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msvd-qa?p=video-chatgpt-towards-detailed-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-chatgpt-towards-detailed-video/video-based-generative-performance)](https://paperswithcode.com/sota/video-based-generative-performance?p=video-chatgpt-towards-detailed-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-chatgpt-towards-detailed-video/zeroshot-video-question-answer-on-activitynet)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-activitynet?p=video-chatgpt-towards-detailed-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-chatgpt-towards-detailed-video/zeroshot-video-question-answer-on-msrvtt-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msrvtt-qa?p=video-chatgpt-towards-detailed-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-chatgpt-towards-detailed-video/video-question-answering-on-activitynet-qa)](https://paperswithcode.com/sota/video-question-answering-on-activitynet-qa?p=video-chatgpt-towards-detailed-video)`

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

8 Jun 2023 · Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan ·

Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this work addresses the underexplored field of video-based conversation by introducing Video-ChatGPT. It is a multimodal model that merges a video-adapted visual encoder with a LLM. The model is capable of understanding and generating human-like conversations about videos. We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT acquired via manual and semi-automated pipeline that is easily scalable and robust to label noise. We also develop a quantiative evaluation framework for video-based dialogue models to objectively analyse the strengths and weaknesses of proposed models. Our code, models, instruction-sets and demo are released at https://github.com/mbzuai-oryx/Video-ChatGPT.

PDF Abstract

Code

Add Remove Mark official

mbzuai-oryx/video-chatgpt official

890

Tasks

Add Remove

Video-based Generative Performance Benchmarking

Video-based Generative Performance Benchmarking (Consistency)

Video-based Generative Performance Benchmarking (Contextual Understanding)

Video-based Generative Performance Benchmarking (Correctness of Information)

Video-based Generative Performance Benchmarking (Detail Orientation))

Video-based Generative Performance Benchmarking (Temporal Understanding)

Video Question Answering

Video Understanding

Zero-Shot Video Question Answer

Datasets

Introduced in the Paper:

VideoInstruct

Used in the Paper:

ActivityNet-QA

TGIF-QA MSRVTT-QA MSVD-QA MVBench

Results from the Paper

Add Remove

Ranked #5 on Zero-Shot Video Question Answer on TGIF-QA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Question Answering	ActivityNet-QA	Video-ChatGPT	Accuracy	35.2	# 27	Compare
Video Question Answering	ActivityNet-QA	Video-ChatGPT	Confidence score	2.7	# 8	Compare
Zero-Shot Video Question Answer	ActivityNet-QA	Video-ChatGPT	Confidence Score	2.7	# 13	Compare
Zero-Shot Video Question Answer	ActivityNet-QA	Video-ChatGPT	Accuracy	35.2	# 13	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	Video-ChatGPT-7B	Accuracy	49.3	# 16	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	Video-ChatGPT-7B	Confidence Score	2.8	# 14	Compare
Zero-Shot Video Question Answer	MSVD-QA	Video-ChatGPT-7B	Accuracy	64.9	# 12	Compare
Zero-Shot Video Question Answer	MSVD-QA	Video-ChatGPT-7B	Confidence Score	3.3	# 11	Compare
Video Question Answering	MVBench	Video-ChatGPT	Avg.	32.7	# 8	Compare
Zero-Shot Video Question Answer	TGIF-QA	Video-ChatGPT-7B	Accuracy	51.4	# 5	Compare
Zero-Shot Video Question Answer	TGIF-QA	Video-ChatGPT-7B	Confidence Score	3.0	# 5	Compare
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	Video-ChatGPT	gpt-score	2.40	# 7	Compare
Video-based Generative Performance Benchmarking	VideoInstruct	Video-ChatGPT	Correctness of Information	2.4	# 11	Compare
			Detail Orientation	2.52	# 11	Compare
			Contextual Understanding	2.62	# 12	Compare
			Temporal Understanding	1.98	# 12	Compare
			Consistency	2.37	# 11	Compare
			mean	2.38	# 12	Compare
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	Video-ChatGPT	gpt-score	1.98	# 8	Compare
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	Video-ChatGPT	gpt-score	2.52	# 7	Compare
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	Video-ChatGPT	gpt-score	2.62	# 8	Compare
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	Video-ChatGPT	gpt-score	2.37	# 7	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove