InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.

PDF Abstract NeurIPS 2023 PDF NeurIPS 2023 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Visual Question Answering BenchLMM InstructBLIP-7B GPT-3.5 score 44.63 # 6
Visual Question Answering BenchLMM InstructBLIP-13B GPT-3.5 score 45.03 # 5
Visual Question Answering (VQA) InfiMM-Eval InstructBLIP Overall score 28.02 # 8
Deductive 27.56 # 8
Abductive 37.76 # 7
Analogical 20.56 # 7
Params 8B # 1
visual instruction following LLaVA-Bench InstructBLIP-13B avg score 58.2 # 6
visual instruction following LLaVA-Bench InstructBLIP-7B avg score 60.9 # 5
Video Question Answering MVBench InstructBLIP Avg. 32.5 # 9
Visual Question Answering ViP-Bench InstructBLIP-13B (Visual Prompt) GPT-4 score (bbox) 35.8 # 8
GPT-4 score (human) 35.2 # 6

Methods