TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Question Answering	BenchLMM	LLaVA-1.5-13B	GPT-3.5 score	55.53	# 3
Factual Inconsistency Detection in Chart Captioning	CHOCOLATE-FT	LLaVA-1.5-13B	Kendall's Tau-c	0.214	# 4
Factual Inconsistency Detection in Chart Captioning	CHOCOLATE-LLM	LLaVA-1.5-13B	Kendall's Tau-c	0.057	# 5
Factual Inconsistency Detection in Chart Captioning	CHOCOLATE-LVLM	LLaVA-1.5-13B	Kendall's Tau-c	0.002	# 4
Visual Question Answering (VQA)	InfiMM-Eval	LLaVA-1.5	Overall score	32.62	# 5
Visual Question Answering (VQA)	InfiMM-Eval	LLaVA-1.5	Deductive	30.94	# 5
Visual Question Answering (VQA)	InfiMM-Eval	LLaVA-1.5	Abductive	47.91	# 3
Visual Question Answering (VQA)	InfiMM-Eval	LLaVA-1.5	Analogical	24.31	# 4
Visual Question Answering (VQA)	InfiMM-Eval	LLaVA-1.5	Params	13B	# 1
visual instruction following	LLaVA-Bench	LLaVA-v1.5-7B	avg score	63.4	# 4
visual instruction following	LLaVA-Bench	LLaVA-v1.5-13B	avg score	70.7	# 3
Visual Question Answering	MM-Vet	LLaVA-1.5-7B	GPT-4 score	31.1±0.2	# 75
Visual Question Answering	MM-Vet	LLaVA-1.5-7B	Params	7B	# 1
Visual Question Answering	MM-Vet	LLaVA-1.5-13B	GPT-4 score	36.3±0.2	# 52
Visual Question Answering	MM-Vet	LLaVA-1.5-13B	Params	13B	# 1
Visual Question Answering	ViP-Bench	LLaVA-1.5-13B (Visual Prompt)	GPT-4 score (bbox)	41.8	# 6
Visual Question Answering	ViP-Bench	LLaVA-1.5-13B (Visual Prompt)	GPT-4 score (human)	42.9	# 4
Visual Question Answering	ViP-Bench	LLaVA-1.5-13B (Coordinates)	GPT-4 score (bbox)	47.1	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improved-baselines-with-visual-instruction/visual-question-answering-on-benchlmm)](https://paperswithcode.com/sota/visual-question-answering-on-benchlmm?p=improved-baselines-with-visual-instruction)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improved-baselines-with-visual-instruction/visual-instruction-following-on-llava-bench)](https://paperswithcode.com/sota/visual-instruction-following-on-llava-bench?p=improved-baselines-with-visual-instruction)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improved-baselines-with-visual-instruction/factual-inconsistency-detection-in-chart-2)](https://paperswithcode.com/sota/factual-inconsistency-detection-in-chart-2?p=improved-baselines-with-visual-instruction)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improved-baselines-with-visual-instruction/factual-inconsistency-detection-in-chart-3)](https://paperswithcode.com/sota/factual-inconsistency-detection-in-chart-3?p=improved-baselines-with-visual-instruction)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improved-baselines-with-visual-instruction/visual-question-answering-on-vip-bench)](https://paperswithcode.com/sota/visual-question-answering-on-vip-bench?p=improved-baselines-with-visual-instruction)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improved-baselines-with-visual-instruction/factual-inconsistency-detection-in-chart-1)](https://paperswithcode.com/sota/factual-inconsistency-detection-in-chart-1?p=improved-baselines-with-visual-instruction)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improved-baselines-with-visual-instruction/visual-question-answering-vqa-on-core-mm)](https://paperswithcode.com/sota/visual-question-answering-vqa-on-core-mm?p=improved-baselines-with-visual-instruction)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improved-baselines-with-visual-instruction/visual-question-answering-on-mm-vet)](https://paperswithcode.com/sota/visual-question-answering-on-mm-vet?p=improved-baselines-with-visual-instruction)`

Improved Baselines with Visual Instruction Tuning

5 Oct 2023 · Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee ·

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

PDF Abstract

Code

Add Remove Mark official

huggingface/transformers

125,118

haotian-liu/LLaVA

↳ Quickstart in

Spaces

Replicate

16,197

skunkworksai/bakllava

650

sshh12/multi_token

137

x2fd/lvis-instruct4v

123

Tasks

Add Remove

Factual Inconsistency Detection in Chart Captioning

visual instruction following

Visual Question Answering

Visual Question Answering (VQA)

Datasets

MS COCO

Visual Genome

GQA

RefCOCO

OK-VQA

TextVQA

VizWiz

A-OKVQA

MMBench

MM-Vet TextCaps

SEED-Bench LLaVA-Bench

InfiMM-Eval

ViP-Bench

BenchLMM

CHOCOLATE

Results from the Paper

Edit

Ranked #3 on visual instruction following on LLaVA-Bench

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering	BenchLMM	LLaVA-1.5-13B	GPT-3.5 score	55.53	# 3	Compare
Factual Inconsistency Detection in Chart Captioning	CHOCOLATE-FT	LLaVA-1.5-13B	Kendall's Tau-c	0.214	# 4	Compare
Factual Inconsistency Detection in Chart Captioning	CHOCOLATE-LLM	LLaVA-1.5-13B	Kendall's Tau-c	0.057	# 5	Compare
Factual Inconsistency Detection in Chart Captioning	CHOCOLATE-LVLM	LLaVA-1.5-13B	Kendall's Tau-c	0.002	# 4	Compare
Visual Question Answering (VQA)	InfiMM-Eval	LLaVA-1.5	Overall score	32.62	# 5	Compare
			Deductive	30.94	# 5	Compare
			Abductive	47.91	# 3	Compare
			Analogical	24.31	# 4	Compare
			Params	13B	# 1	Compare
visual instruction following	LLaVA-Bench	LLaVA-v1.5-7B	avg score	63.4	# 4	Compare
visual instruction following	LLaVA-Bench	LLaVA-v1.5-13B	avg score	70.7	# 3	Compare
Visual Question Answering	MM-Vet	LLaVA-1.5-7B	GPT-4 score	31.1±0.2	# 75	Compare
Visual Question Answering	MM-Vet	LLaVA-1.5-7B	Params	7B	# 1	Compare
Visual Question Answering	MM-Vet	LLaVA-1.5-13B	GPT-4 score	36.3±0.2	# 52	Compare
Visual Question Answering	MM-Vet	LLaVA-1.5-13B	Params	13B	# 1	Compare
Visual Question Answering	ViP-Bench	LLaVA-1.5-13B (Visual Prompt)	GPT-4 score (bbox)	41.8	# 6	Compare
Visual Question Answering	ViP-Bench	LLaVA-1.5-13B (Visual Prompt)	GPT-4 score (human)	42.9	# 4	Compare
Visual Question Answering	ViP-Bench	LLaVA-1.5-13B (Coordinates)	GPT-4 score (bbox)	47.1	# 4	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Improved Baselines with Visual Instruction Tuning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove