TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Question Answering	BenchLMM	LLaVA-1.5-7B	GPT-3.5 score	46.83	# 4
Visual Question Answering	BenchLMM	LLaVA-1-13B	GPT-3.5 score	43.50	# 7
Video Question Answering	MVBench	LLaVa	Avg.	36.0	# 5

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/visual-instruction-tuning-1/visual-question-answering-on-benchlmm)](https://paperswithcode.com/sota/visual-question-answering-on-benchlmm?p=visual-instruction-tuning-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/visual-instruction-tuning-1/video-question-answering-on-mvbench)](https://paperswithcode.com/sota/video-question-answering-on-mvbench?p=visual-instruction-tuning-1)`

Visual Instruction Tuning

NeurIPS 2023 · Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee ·

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.

PDF Abstract NeurIPS 2023 PDF NeurIPS 2023 Abstract

Code

Add Remove Mark official

haotian-liu/LLaVA official

↳ Quickstart in

Spaces

Replicate

16,136

huggingface/transformers

124,984

computer-vision-in-the-wild/cvinw_r…

↳ Quickstart in

Spaces

1,001

skunkworksai/bakllava

649

tabtoyou/kollava

196

See all 9 implementations

Tasks

Add Remove

Instruction Following

Video Question Answering

visual instruction following

Visual Question Answering

Visual Reasoning

Datasets

Introduced in the Paper:

LLaVA-Bench

Used in the Paper:

MVBench

BenchLMM

Results from the Paper

Edit

Ranked #4 on Visual Question Answering on BenchLMM

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering	BenchLMM	LLaVA-1.5-7B	GPT-3.5 score	46.83	# 4	Compare
Visual Question Answering	BenchLMM	LLaVA-1-13B	GPT-3.5 score	43.50	# 7	Compare
Video Question Answering	MVBench	LLaVa	Avg.	36.0	# 5	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BASE • BPE • Dense Connections • Dropout • GPT-4 • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Visual Instruction Tuning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove