BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 Jan 2023  ·  Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi ·

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Retrieval COCO BLIP-2 ViT-G (fine-tuned) Recall@10 92.6 # 3
recall@1 68.3 # 1
recall@5 87.7 # 2
Image Retrieval COCO BLIP-2 ViT-L (fine-tuned) Recall@10 91.8 # 4
recall@1 66.3 # 3
recall@5 86.5 # 3
Image-to-Text Retrieval COCO BLIP-2 ViT-G (fine-tuned) Recall@10 98.5 # 2
Recall@1 85.4 # 1
Recall@5 97.0 # 1
Image-to-Text Retrieval COCO BLIP-2 ViT-L (fine-tuned) Recall@10 98.0 # 3
Recall@1 83.5 # 2
Recall@5 96.0 # 2
Image Captioning COCO Captions BLIP-2 ViT-G OPT 2.7B (zero-shot) BLEU-4 43.7 # 4
CIDER 145.8 # 4
Image Captioning COCO Captions BLIP-2 ViT-G FlanT5 XL (zero-shot) BLEU-4 42.4 # 9
CIDER 144.5 # 8
Image Captioning COCO Captions BLIP-2 ViT-G OPT 6.7B (zero-shot) BLEU-4 43.5 # 5
CIDER 145.2 # 7
Image-to-Text Retrieval Flickr30k BLIP-2 ViT-L (zero-shot, 1K test set) Recall@1 96.9 # 2
Recall@5 100 # 1
Recall@10 100 # 1
Image Retrieval Flickr30k BLIP-2 ViT-L (zero-shot, 1K test set) Recall@5 97.6 # 2
Recall@10 98.9 # 1
Recall@1 88.6 # 2
Image Retrieval Flickr30k BLIP-2 ViT-G (zero-shot, 1K test set) Recall@5 98.1 # 1
Recall@10 98.9 # 1
Recall@1 89.7 # 1
Image-to-Text Retrieval Flickr30k BLIP-2 ViT-G (zero-shot, 1K test set) Recall@1 97.6 # 1
Recall@5 100 # 1
Recall@10 100 # 1
Visual Question Answering (VQA) GQA test-dev BLIP-2 ViT-L FlanT5 XL (zero-shot) Accuracy 44.4 # 6
Visual Question Answering (VQA) GQA test-dev BLIP-2 ViT-G FlanT5 XL (zero-shot) Accuracy 44.2 # 7
Visual Question Answering (VQA) GQA test-dev BLIP-2 ViT-G FlanT5 XXL (zero-shot) Accuracy 44.7 # 5
Visual Question Answering (VQA) GQA test-dev BLIP-2 ViT-L OPT 2.7B (zero-shot) Accuracy 33.9 # 11
Visual Question Answering (VQA) GQA test-dev BLIP-2 ViT-G OPT 2.7B (zero-shot) Accuracy 34.6 # 10
Visual Question Answering (VQA) GQA test-dev BLIP-2 ViT-G OPT 6.7B (zero-shot) Accuracy 36.4 # 9
Image Captioning nocaps-val-in-domain BLIP-2 ViT-G FlanT5 XL (zero-shot) CIDEr 123.7 # 1
SPICE 16.3 # 1
Pre-train (#images) 1.1B # 1
Image Captioning nocaps-val-in-domain BLIP-2 ViT-G OPT 6.7B (zero-shot) CIDEr 123.7 # 1
SPICE 15.8 # 2
Pre-train (#images) 1.1B # 1
Image Captioning nocaps-val-in-domain BLIP-2 ViT-G OPT 2.7B (zero-shot) CIDEr 123 # 3
SPICE 15.8 # 2
Pre-train (#images) 1.1B # 1
Image Captioning nocaps-val-near-domain BLIP-2 ViT-G FlanT5 XL (zero-shot) CIDEr 120.2 # 1
SPICE 15.9 # 1
Pre-train (#images) 1.1B # 1
Image Captioning nocaps-val-near-domain BLIP-2 ViT-G OPT 6.7B (zero-shot) CIDEr 119.2 # 2
SPICE 15.3 # 3
Pre-train (#images) 1.1B # 1
Image Captioning nocaps-val-near-domain BLIP-2 ViT-G OPT 2.7B (zero-shot) CIDEr 117.8 # 3
SPICE 15.4 # 2
Pre-train (#images) 1.1B # 1
Image Captioning nocaps-val-out-domain BLIP-2 ViT-G OPT 6.7B (zero-shot) CIDEr 124.4 # 2
SPICE 14.8 # 3
Pretrain (#images) 1.1B # 1
Image Captioning nocaps-val-out-domain BLIP-2 ViT-G FlanT5 XL (zero-shot) CIDEr 124.8 # 1
SPICE 15.1 # 1
Pretrain (#images) 1.1B # 1
Image Captioning nocaps-val-out-domain BLIP-2 ViT-G OPT 2.7B (zero-shot) CIDEr 123.4 # 3
SPICE 15.1 # 1
Pretrain (#images) 1.1B # 1
Image Captioning nocaps-val-overall BLIP-2 ViT-G OPT 2.7B (zero-shot) CIDEr 119.7 # 3
SPICE 15.4 # 2
Pretrain (#images) 1.1B # 1
Image Captioning nocaps-val-overall BLIP-2 ViT-G FlanT5 XL (zero-shot) CIDEr 121.6 # 1
SPICE 15.8 # 1
Pretrain (#images) 1.1B # 1
Image Captioning nocaps-val-overall BLIP-2 ViT-G OPT 6.7B (zero-shot) CIDEr 121.0 # 2
SPICE 15.3 # 3
Pretrain (#images) 1.1B # 1
Visual Question Answering (VQA) OK-VQA BLIP-2 ViT-G FlanT5 XXL (zero-shot) Accuracy 45.9 # 12
Visual Question Answering (VQA) OK-VQA BLIP-2 ViT-G FlanT5 XL (zero-shot) Accuracy 40.7 # 17
Visual Question Answering (VQA) OK-VQA BLIP-2 ViT-L FlanT5 XL (zero-shot) Accuracy 39.4 # 18
Visual Question Answering (VQA) OK-VQA BLIP-2 ViT-L OPT 2.7B (zero-shot) Accuracy 30.2 # 22
Visual Question Answering (VQA) OK-VQA BLIP-2 ViT-G OPT 2.7B (zero-shot) Accuracy 31.7 # 21
Visual Question Answering (VQA) OK-VQA BLIP-2 ViT-G OPT 6.7B (zero-shot) Accuracy 36.4 # 19
Visual Question Answering (VQA) VQA v2 test-dev BLIP-2 ViT-G FlanT5 XL (fine-tuned) Accuracy 81.66 # 10
Visual Question Answering (VQA) VQA v2 test-dev BLIP-2 ViT-G OPT 2.7B (fine-tuned) Accuracy 81.74 # 9
Visual Question Answering (VQA) VQA v2 test-dev BLIP-2 ViT-G OPT 6.7B (fine-tuned) Accuracy 82.30 # 5
Visual Question Answering (VQA) VQA v2 test-dev BLIP-2 ViT-L FlanT5 XL (zero-shot) Accuracy 62.3 # 47
Visual Question Answering (VQA) VQA v2 test-dev BLIP-2 ViT-G FlanT5 XXL (zero-shot) Accuracy 65 # 41
Visual Question Answering (VQA) VQA v2 test-dev BLIP-2 ViT-G FlanT5 XL (zero-shot) Accuracy 63 # 46
Visual Question Answering (VQA) VQA v2 test-dev BLIP-2 ViT-G OPT 2.7B (zero-shot) Accuracy 52.3 # 50
Visual Question Answering (VQA) VQA v2 test-dev BLIP-2 ViT-G OPT 6.7B (zero-shot) Accuracy 52.6 # 49
Visual Question Answering (VQA) VQA v2 test-dev BLIP-2 ViT-L OPT 2.7B (zero-shot) Accuracy 49.7 # 53
Visual Question Answering (VQA) VQA v2 val BLIP-2 ViT-G OPT 6.7B (zero-shot) Accuracy 54.3 # 8
Visual Question Answering (VQA) VQA v2 val BLIP-2 ViT-G OPT 2.7B (zero-shot) Accuracy 53.5 # 9
Visual Question Answering (VQA) VQA v2 val BLIP-2 ViT-L OPT 2.7B (zero-shot) Accuracy 50.1 # 10
Visual Question Answering (VQA) VQA v2 val BLIP-2 ViT-G OPT 6.7B (fine-tuned) Accuracy 82.19 # 1
Visual Question Answering (VQA) VQA v2 val BLIP-2 ViT-G OPT 2.7B (fine-tuned) Accuracy 81.59 # 2
Visual Question Answering (VQA) VQA v2 val BLIP-2 ViT-G FlanT5 XL (fine-tuned) Accuracy 81.55 # 3
Visual Question Answering (VQA) VQA v2 val BLIP-2 ViT-G FlanT5 XXL (zero-shot) Accuracy 65.2 # 4
Visual Question Answering (VQA) VQA v2 val BLIP-2 ViT-G FlanT5 XL (zero-shot) Accuracy 63.1 # 6
Visual Question Answering (VQA) VQA v2 val BLIP-2 ViT-L FlanT5 XL (zero-shot) Accuracy 62.6 # 7

Methods