TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Question Answering	BenchLMM	InstructBLIP-7B	GPT-3.5 score	44.63	# 6
Visual Question Answering	BenchLMM	InstructBLIP-13B	GPT-3.5 score	45.03	# 5
Visual Question Answering (VQA)	InfiMM-Eval	InstructBLIP	Overall score	28.02	# 8
Visual Question Answering (VQA)	InfiMM-Eval	InstructBLIP	Deductive	27.56	# 8
Visual Question Answering (VQA)	InfiMM-Eval	InstructBLIP	Abductive	37.76	# 7
Visual Question Answering (VQA)	InfiMM-Eval	InstructBLIP	Analogical	20.56	# 7
Visual Question Answering (VQA)	InfiMM-Eval	InstructBLIP	Params	8B	# 1
visual instruction following	LLaVA-Bench	InstructBLIP-13B	avg score	58.2	# 6
visual instruction following	LLaVA-Bench	InstructBLIP-7B	avg score	60.9	# 5
Video Question Answering	MVBench	InstructBLIP	Avg.	32.5	# 9
Visual Question Answering	ViP-Bench	InstructBLIP-13B (Visual Prompt)	GPT-4 score (bbox)	35.8	# 8
Visual Question Answering	ViP-Bench	InstructBLIP-13B (Visual Prompt)	GPT-4 score (human)	35.2	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/instructblip-towards-general-purpose-vision/visual-question-answering-on-benchlmm)](https://paperswithcode.com/sota/visual-question-answering-on-benchlmm?p=instructblip-towards-general-purpose-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/instructblip-towards-general-purpose-vision/visual-instruction-following-on-llava-bench)](https://paperswithcode.com/sota/visual-instruction-following-on-llava-bench?p=instructblip-towards-general-purpose-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/instructblip-towards-general-purpose-vision/visual-question-answering-vqa-on-core-mm)](https://paperswithcode.com/sota/visual-question-answering-vqa-on-core-mm?p=instructblip-towards-general-purpose-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/instructblip-towards-general-purpose-vision/visual-question-answering-on-vip-bench)](https://paperswithcode.com/sota/visual-question-answering-on-vip-bench?p=instructblip-towards-general-purpose-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/instructblip-towards-general-purpose-vision/video-question-answering-on-mvbench)](https://paperswithcode.com/sota/video-question-answering-on-mvbench?p=instructblip-towards-general-purpose-vision)`

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

NeurIPS 2023 · Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi ·

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.

PDF Abstract NeurIPS 2023 PDF NeurIPS 2023 Abstract

Code

Add Remove Mark official

salesforce/lavis official

8,713

tabtoyou/kollava

194

Tasks

Add Remove

Video Question Answering

visual instruction following

Visual Question Answering

Visual Question Answering (VQA)

Datasets

MS COCO

Flickr30k

GQA

OK-VQA

TextVQA

VisDial

Hateful Memes

NoCaps

VizWiz

ScienceQA LLaVA-Bench

VSR

IconQA

iVQA

InfiMM-Eval MVBench

ViP-Bench Temp1

BenchLMM

Results from the Paper

Add Remove

Ranked #5 on visual instruction following on LLaVA-Bench

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering	BenchLMM	InstructBLIP-7B	GPT-3.5 score	44.63	# 6	Compare
Visual Question Answering	BenchLMM	InstructBLIP-13B	GPT-3.5 score	45.03	# 5	Compare
Visual Question Answering (VQA)	InfiMM-Eval	InstructBLIP	Overall score	28.02	# 8	Compare
			Deductive	27.56	# 8	Compare
			Abductive	37.76	# 7	Compare
			Analogical	20.56	# 7	Compare
			Params	8B	# 1	Compare
visual instruction following	LLaVA-Bench	InstructBLIP-13B	avg score	58.2	# 6	Compare
visual instruction following	LLaVA-Bench	InstructBLIP-7B	avg score	60.9	# 5	Compare
Video Question Answering	MVBench	InstructBLIP	Avg.	32.5	# 9	Compare
Visual Question Answering	ViP-Bench	InstructBLIP-13B (Visual Prompt)	GPT-4 score (bbox)	35.8	# 8	Compare
Visual Question Answering	ViP-Bench	InstructBLIP-13B (Visual Prompt)	GPT-4 score (human)	35.2	# 6	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove