Improved Baselines with Visual Instruction Tuning

5 Oct 2023  ·  Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee ·

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Visual Question Answering BenchLMM LLaVA-1.5-13B GPT-3.5 score 55.53 # 3
Factual Inconsistency Detection in Chart Captioning CHOCOLATE-FT LLaVA-1.5-13B Kendall's Tau-c 0.214 # 4
Factual Inconsistency Detection in Chart Captioning CHOCOLATE-LLM LLaVA-1.5-13B Kendall's Tau-c 0.057 # 5
Factual Inconsistency Detection in Chart Captioning CHOCOLATE-LVLM LLaVA-1.5-13B Kendall's Tau-c 0.002 # 4
Visual Question Answering (VQA) InfiMM-Eval LLaVA-1.5 Overall score 32.62 # 5
Deductive 30.94 # 5
Abductive 47.91 # 3
Analogical 24.31 # 4
Params 13B # 1
visual instruction following LLaVA-Bench LLaVA-v1.5-7B avg score 63.4 # 4
visual instruction following LLaVA-Bench LLaVA-v1.5-13B avg score 70.7 # 3
Visual Question Answering MM-Vet LLaVA-1.5-7B GPT-4 score 31.1±0.2 # 75
Params 7B # 1
Visual Question Answering MM-Vet LLaVA-1.5-13B GPT-4 score 36.3±0.2 # 52
Params 13B # 1
Visual Question Answering ViP-Bench LLaVA-1.5-13B (Visual Prompt) GPT-4 score (bbox) 41.8 # 6
GPT-4 score (human) 42.9 # 4
Visual Question Answering ViP-Bench LLaVA-1.5-13B (Coordinates) GPT-4 score (bbox) 47.1 # 4

Methods


No methods listed for this paper. Add relevant methods here