Improved Baselines with Visual Instruction Tuning

CVPR 2024  ·  Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee ·

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

PDF Abstract CVPR 2024 PDF CVPR 2024 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Visual Question Answering (VQA) AutoHallusion LLaVA-1.5 Overall Accuracy 44.5 # 4
Visual Question Answering BenchLMM LLaVA-1.5-13B GPT-3.5 score 55.53 # 3
Factual Inconsistency Detection in Chart Captioning CHOCOLATE-FT LLaVA-1.5-13B Kendall's Tau-c 0.214 # 4
Factual Inconsistency Detection in Chart Captioning CHOCOLATE-LLM LLaVA-1.5-13B Kendall's Tau-c 0.057 # 5
Factual Inconsistency Detection in Chart Captioning CHOCOLATE-LVLM LLaVA-1.5-13B Kendall's Tau-c 0.002 # 4
Image Classification ColonINST-v1 (Seen) LLaVA-v1.5 (w/ LoRA, w/o extra data) Accuray 92.97 # 9
Referring Expression Comprehension ColonINST-v1 (Seen) LLaVA-v1.5 (w/ LoRA, w/ extra data) Intersection over Union 61.97 # 3
Referring Expression Comprehension ColonINST-v1 (Seen) LLaVA-v1.5 (w/ LoRA, w/o extra data) Intersection over Union 55.72 # 5
Referring expression generation ColonINST-v1 (Seen) LLaVA-v1.5 (w/ LoRA, w/ extra data) Accuray 99.32 # 2
Referring expression generation ColonINST-v1 (Seen) LLaVA-v1.5 (w/ LoRA, w/o extra data) Accuray 98.58 # 6
Image Classification ColonINST-v1 (Seen) LLaVA-v1.5 (w/ LoRA, w/ extra data) Accuray 93.33 # 6
Referring expression generation ColonINST-v1 (Unseen) LLaVA-v1.5 (w/ LoRA, w/ extra data) Accuray 72.88 # 9
Image Classification ColonINST-v1 (Unseen) LLaVA-v1.5 (w/ LoRA, w/o extra data) Accuray 79.10 # 6
Referring Expression Comprehension ColonINST-v1 (Unseen) LLaVA-v1.5 (w/ LoRA, w/ extra data) Intersection over Union 42.31 # 2
Referring Expression Comprehension ColonINST-v1 (Unseen) LLaVA-v1.5 (w/ LoRA, w/o extra data) Intersection over Union 34.32 # 6
Image Classification ColonINST-v1 (Unseen) LLaVA-v1.5 (w/ LoRA, w/ extra data) Accuray 80.89 # 2
Referring expression generation ColonINST-v1 (Unseen) LLaVA-v1.5 (w/ LoRA, w/o extra data) Accuray 70.38 # 11
Visual Question Answering (VQA) InfiMM-Eval LLaVA-1.5 Overall score 32.62 # 5
Deductive 30.94 # 5
Abductive 47.91 # 3
Analogical 24.31 # 4
Params 13B # 1
visual instruction following LLaVA-Bench LLaVA-v1.5-13B avg score 70.7 # 4
visual instruction following LLaVA-Bench LLaVA-v1.5-7B avg score 63.4 # 5
Visual Question Answering MM-Vet LLaVA-1.5-7B GPT-4 score 31.1±0.2 # 157
Params 7B # 1
Visual Question Answering MM-Vet LLaVA-1.5-13B GPT-4 score 36.3±0.2 # 112
Params 13B # 1
Visual Question Answering MM-Vet v2 LLaVA-v1.5-13B GPT-4 score 33.2±0.1 # 18
Params 13B # 1
Visual Question Answering MM-Vet v2 LLaVA-v1.5-7B GPT-4 score 28.3±0.2 # 19
Params 7B # 1
Visual Question Answering ViP-Bench LLaVA-1.5-13B (Visual Prompt) GPT-4 score (bbox) 41.8 # 6
GPT-4 score (human) 42.9 # 4
Visual Question Answering ViP-Bench LLaVA-1.5-13B (Coordinates) GPT-4 score (bbox) 47.1 # 4

Methods


No methods listed for this paper. Add relevant methods here