TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Video Question Answer	ActivityNet-QA	Video LLaMA	Confidence Score	1.1	# 17
Zero-Shot Video Question Answer	ActivityNet-QA	Video LLaMA	Accuracy	12.4	# 19
Zero-Shot Video Question Answer	MSRVTT-QA	Video LLaMA-7B	Accuracy	29.6	# 21
Zero-Shot Video Question Answer	MSRVTT-QA	Video LLaMA-7B	Confidence Score	1.8	# 19
Zero-Shot Video Question Answer	MSVD-QA	Video LLaMA-7B	Accuracy	51.6	# 17
Zero-Shot Video Question Answer	MSVD-QA	Video LLaMA-7B	Confidence Score	2.5	# 16
Video Question Answering	MVBench	VideoLLaMA	Avg.	34.1	# 8
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	Video LLaMA	gpt-score	2.16	# 13
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	Video LLaMA	gpt-score	1.79	# 13
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	Video LLaMA	gpt-score	1.96	# 13
Video-based Generative Performance Benchmarking	VideoInstruct	Video LLaMA	Correctness of Information	1.96	# 16
Video-based Generative Performance Benchmarking	VideoInstruct	Video LLaMA	Detail Orientation	2.18	# 16
Video-based Generative Performance Benchmarking	VideoInstruct	Video LLaMA	Contextual Understanding	2.16	# 16
Video-based Generative Performance Benchmarking	VideoInstruct	Video LLaMA	Temporal Understanding	1.82	# 16
Video-based Generative Performance Benchmarking	VideoInstruct	Video LLaMA	Consistency	1.79	# 16
Video-based Generative Performance Benchmarking	VideoInstruct	Video LLaMA	mean	1.98	# 16
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	Video LLaMA	gpt-score	1.82	# 13
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	Video LLaMA	gpt-score	2.18	# 13

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-llama-an-instruction-tuned-audio-visual/video-question-answering-on-mvbench)](https://paperswithcode.com/sota/video-question-answering-on-mvbench?p=video-llama-an-instruction-tuned-audio-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-llama-an-instruction-tuned-audio-visual/video-based-generative-performance-3)](https://paperswithcode.com/sota/video-based-generative-performance-3?p=video-llama-an-instruction-tuned-audio-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-llama-an-instruction-tuned-audio-visual/video-based-generative-performance-2)](https://paperswithcode.com/sota/video-based-generative-performance-2?p=video-llama-an-instruction-tuned-audio-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-llama-an-instruction-tuned-audio-visual/video-based-generative-performance-1)](https://paperswithcode.com/sota/video-based-generative-performance-1?p=video-llama-an-instruction-tuned-audio-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-llama-an-instruction-tuned-audio-visual/video-based-generative-performance-5)](https://paperswithcode.com/sota/video-based-generative-performance-5?p=video-llama-an-instruction-tuned-audio-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-llama-an-instruction-tuned-audio-visual/video-based-generative-performance-4)](https://paperswithcode.com/sota/video-based-generative-performance-4?p=video-llama-an-instruction-tuned-audio-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-llama-an-instruction-tuned-audio-visual/video-based-generative-performance)](https://paperswithcode.com/sota/video-based-generative-performance?p=video-llama-an-instruction-tuned-audio-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-llama-an-instruction-tuned-audio-visual/zeroshot-video-question-answer-on-msvd-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msvd-qa?p=video-llama-an-instruction-tuned-audio-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-llama-an-instruction-tuned-audio-visual/zeroshot-video-question-answer-on-activitynet)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-activitynet?p=video-llama-an-instruction-tuned-audio-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-llama-an-instruction-tuned-audio-visual/zeroshot-video-question-answer-on-msrvtt-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msrvtt-qa?p=video-llama-an-instruction-tuned-audio-visual)`

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

5 Jun 2023 · Hang Zhang, Xin Li, Lidong Bing ·

We present Video-LLaMA a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual and audio encoders and the frozen LLMs. Unlike previous works that complement LLMs to process the visual or audio signals only, Video-LLaMA enables video comprehension by tackling two challenges: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals. To counter the first challenge, we propose a Video Q-former to assemble a pre-trained image encoder into our video encoder and introduce a video-to-text generation task to learn video-language correspondence. For the second challenge, we leverage ImageBind, a universal embedding model aligning multiple modalities, as the pre-trained audio encoder and introduce an Audio Q-former on top of ImageBind to learn reasonable auditory query embeddings for the LLM module. To align the output of both visual and audio encoders with LLM's embedding space, we first train Video-LLaMA on massive video/image-caption pairs and then tune our model with visual-instruction datasets of moderate amount but higher quality. We found Video-LLaMA shows the ability to perceive and comprehend video content and generate meaningful responses grounded in the visual and auditory information presented in the videos.

PDF Abstract

Code

Add Remove Mark official

damo-nlp-sg/video-llama official

↳ Quickstart in

Spaces

2,455

Tasks

Add Remove

Language Modelling

Text Generation

Video-based Generative Performance Benchmarking

Video-based Generative Performance Benchmarking (Consistency)

Video-based Generative Performance Benchmarking (Contextual Understanding)

Video-based Generative Performance Benchmarking (Correctness of Information)

Video-based Generative Performance Benchmarking (Detail Orientation))

Video-based Generative Performance Benchmarking (Temporal Understanding)

Video Question Answering

Video Understanding

Zero-Shot Video Question Answer

Datasets

ActivityNet-QA MSRVTT-QA MSVD-QA VideoInstruct MVBench

Results from the Paper

Edit

Ranked #8 on Video Question Answering on MVBench

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Video Question Answer	ActivityNet-QA	Video LLaMA	Confidence Score	1.1	# 17	Compare
Zero-Shot Video Question Answer	ActivityNet-QA	Video LLaMA	Accuracy	12.4	# 19	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	Video LLaMA-7B	Accuracy	29.6	# 21	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	Video LLaMA-7B	Confidence Score	1.8	# 19	Compare
Zero-Shot Video Question Answer	MSVD-QA	Video LLaMA-7B	Accuracy	51.6	# 17	Compare
Zero-Shot Video Question Answer	MSVD-QA	Video LLaMA-7B	Confidence Score	2.5	# 16	Compare
Video Question Answering	MVBench	VideoLLaMA	Avg.	34.1	# 8	Compare
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	Video LLaMA	gpt-score	2.16	# 13	Compare
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	Video LLaMA	gpt-score	1.79	# 13	Compare
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	Video LLaMA	gpt-score	1.96	# 13	Compare
Video-based Generative Performance Benchmarking	VideoInstruct	Video LLaMA	Correctness of Information	1.96	# 16	Compare
			Detail Orientation	2.18	# 16	Compare
			Contextual Understanding	2.16	# 16	Compare
			Temporal Understanding	1.82	# 16	Compare
			Consistency	1.79	# 16	Compare
			mean	1.98	# 16	Compare
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	Video LLaMA	gpt-score	1.82	# 13	Compare
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	Video LLaMA	gpt-score	2.18	# 13	Compare

Methods

Add Remove

ALIGN

Edit Social Preview

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove