TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Video Question Answer	ActivityNet-QA	Video-LLaVA	Confidence Score	3.3	# 5
Zero-Shot Video Question Answer	ActivityNet-QA	Video-LLaVA	Accuracy	45.3	# 12
Video Question Answering	ActivityNet-QA	Video-LLaVA	Accuracy	45.3	# 16
Video Question Answering	ActivityNet-QA	Video-LLaVA	Confidence score	3.3	# 2
Visual Question Answering	MM-Vet	Video-LLaVA	GPT-4 score	32.0	# 65
Zero-Shot Video Question Answer	MSRVTT-QA	Video-LLaVA-7B	Accuracy	59.2	# 7
Zero-Shot Video Question Answer	MSRVTT-QA	Video-LLaVA-7B	Confidence Score	3.5	# 2
Zero-Shot Video Question Answer	MSVD-QA	Video-LLaVA-7B	Accuracy	70.7	# 6
Zero-Shot Video Question Answer	MSVD-QA	Video-LLaVA-7B	Confidence Score	3.9	# 3
Zero-Shot Video Question Answer	TGIF-QA	Video-LLaVA-7B	Accuracy	70.0	# 3
Zero-Shot Video Question Answer	TGIF-QA	Video-LLaVA-7B	Confidence Score	4.0	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-llava-learning-united-visual-1/zeroshot-video-question-answer-on-tgif-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-tgif-qa?p=video-llava-learning-united-visual-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-llava-learning-united-visual-1/zeroshot-video-question-answer-on-msvd-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msvd-qa?p=video-llava-learning-united-visual-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-llava-learning-united-visual-1/zeroshot-video-question-answer-on-msrvtt-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msrvtt-qa?p=video-llava-learning-united-visual-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-llava-learning-united-visual-1/zeroshot-video-question-answer-on-activitynet)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-activitynet?p=video-llava-learning-united-visual-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-llava-learning-united-visual-1/video-question-answering-on-activitynet-qa)](https://paperswithcode.com/sota/video-question-answering-on-activitynet-qa?p=video-llava-learning-united-visual-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-llava-learning-united-visual-1/visual-question-answering-on-mm-vet)](https://paperswithcode.com/sota/visual-question-answering-on-mm-vet?p=video-llava-learning-united-visual-1)`

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

16 Nov 2023 · Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan ·

The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits. Additionally, our Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos. We aim for this work to provide modest insights into the multi-modal inputs for the LLM.

PDF Abstract

Code

Add Remove Mark official

PKU-YuanGroup/Video-LLaVA official

2,389

PKU-YuanGroup/MoE-LLaVA

1,672

pku-yuangroup/languagebind

530

pku-yuangroup/video-bench

Tasks

Add Remove

Language Modelling

Large Language Model

Question Answering

Video Question Answering

Visual Question Answering

Visual Question Answering (VQA)

Zero-Shot Video Question Answer

Datasets

MSR-VTT

GQA

MSVD

MMBench

MM-Vet

ActivityNet-QA

TGIF-QA MSRVTT-QA MSVD-QA LLaVA-Bench

Results from the Paper

Edit

Ranked #3 on Zero-Shot Video Question Answer on TGIF-QA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Video Question Answer	ActivityNet-QA	Video-LLaVA	Confidence Score	3.3	# 5	Compare
Zero-Shot Video Question Answer	ActivityNet-QA	Video-LLaVA	Accuracy	45.3	# 12	Compare
Video Question Answering	ActivityNet-QA	Video-LLaVA	Accuracy	45.3	# 16	Compare
Video Question Answering	ActivityNet-QA	Video-LLaVA	Confidence score	3.3	# 2	Compare
Visual Question Answering	MM-Vet	Video-LLaVA	GPT-4 score	32.0	# 65	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	Video-LLaVA-7B	Accuracy	59.2	# 7	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	Video-LLaVA-7B	Confidence Score	3.5	# 2	Compare
Zero-Shot Video Question Answer	MSVD-QA	Video-LLaVA-7B	Accuracy	70.7	# 6	Compare
Zero-Shot Video Question Answer	MSVD-QA	Video-LLaVA-7B	Confidence Score	3.9	# 3	Compare
Zero-Shot Video Question Answer	TGIF-QA	Video-LLaVA-7B	Accuracy	70.0	# 3	Compare
Zero-Shot Video Question Answer	TGIF-QA	Video-LLaVA-7B	Confidence Score	4.0	# 3	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove