Visual Instruction Tuning

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.

PDF Abstract NeurIPS 2023 PDF NeurIPS 2023 Abstract

Results from the Paper


Ranked #4 on MMR total on MRR-Benchmark (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Visual Question Answering BenchLMM LLaVA-1.5-7B GPT-3.5 score 46.83 # 4
Visual Question Answering BenchLMM LLaVA-1-13B GPT-3.5 score 43.50 # 7
Referring expression generation ColonINST-v1 (Seen) LLaVA-v1 (w/ LoRA, w/ extra data) Accuray 86.87 # 16
Referring Expression Comprehension ColonINST-v1 (Seen) LLaVA-v1 (w/ LoRA, w/ extra data) Intersection over Union 21.81 # 15
Referring Expression Comprehension ColonINST-v1 (Seen) LLaVA-v1 (w/ LoRA, w/o extra data) Intersection over Union 20.05 # 16
Referring expression generation ColonINST-v1 (Seen) LLaVA-v1 (w/ LoRA, w/o extra data) Accuray 84.55 # 17
Image Classification ColonINST-v1 (Seen) LLaVA-v1 (w/ LoRA, w/o extra data) Accuray 87.86 # 16
Image Classification ColonINST-v1 (Seen) LLaVA-v1 (w/ LoRA, w/ extra data) Accuray 89.61 # 15
Referring expression generation ColonINST-v1 (Unseen) LLaVA-v1 (w/ LoRA, w/o extra data) Accuray 68.11 # 16
Image Classification ColonINST-v1 (Unseen) LLaVA-v1 (w/ LoRA, w/ extra data) Accuray 42.17 # 17
Referring Expression Comprehension ColonINST-v1 (Unseen) LLaVA-v1 (w/ LoRA, w/o extra data) Intersection over Union 12.72 # 16
Image Classification ColonINST-v1 (Unseen) LLaVA-v1 (w/ LoRA, w/o extra data) Accuray 72.08 # 15
Referring Expression Comprehension ColonINST-v1 (Unseen) LLaVA-v1 (w/ LoRA, w/ extra data) Intersection over Union 3.24 # 17
Referring expression generation ColonINST-v1 (Unseen) LLaVA-v1 (w/ LoRA, w/ extra data) Accuray 46.85 # 17
MMR total MRR-Benchmark LLaVA-1.5-13B Total Column Score 243 # 11
MMR total MRR-Benchmark LLaVA-NEXT-34B Total Column Score 412 # 4
MMR total MRR-Benchmark LLaVA-NEXT-13B Total Column Score 335 # 8
Video Question Answering MVBench LLaVa Avg. 36.0 # 16

Methods