TASK |
DATASET |
MODEL |
METRIC NAME |
METRIC VALUE |
GLOBAL RANK |
REMOVE |
Image Retrieval
|
COCO
|
BLIP-2 ViT-G (fine-tuned)
|
Recall@10
|
92.6
|
# 3
|
|
Image Retrieval
|
COCO
|
BLIP-2 ViT-G (fine-tuned)
|
recall@1
|
68.3
|
# 1
|
|
Image Retrieval
|
COCO
|
BLIP-2 ViT-G (fine-tuned)
|
recall@5
|
87.7
|
# 2
|
|
Image Retrieval
|
COCO
|
BLIP-2 ViT-L (fine-tuned)
|
Recall@10
|
91.8
|
# 4
|
|
Image Retrieval
|
COCO
|
BLIP-2 ViT-L (fine-tuned)
|
recall@1
|
66.3
|
# 3
|
|
Image Retrieval
|
COCO
|
BLIP-2 ViT-L (fine-tuned)
|
recall@5
|
86.5
|
# 3
|
|
Image-to-Text Retrieval
|
COCO
|
BLIP-2 ViT-G (fine-tuned)
|
Recall@10
|
98.5
|
# 2
|
|
Image-to-Text Retrieval
|
COCO
|
BLIP-2 ViT-G (fine-tuned)
|
Recall@1
|
85.4
|
# 1
|
|
Image-to-Text Retrieval
|
COCO
|
BLIP-2 ViT-G (fine-tuned)
|
Recall@5
|
97.0
|
# 1
|
|
Image-to-Text Retrieval
|
COCO
|
BLIP-2 ViT-L (fine-tuned)
|
Recall@10
|
98.0
|
# 3
|
|
Image-to-Text Retrieval
|
COCO
|
BLIP-2 ViT-L (fine-tuned)
|
Recall@1
|
83.5
|
# 2
|
|
Image-to-Text Retrieval
|
COCO
|
BLIP-2 ViT-L (fine-tuned)
|
Recall@5
|
96.0
|
# 2
|
|
Image Captioning
|
COCO Captions
|
BLIP-2 ViT-G OPT 2.7B (zero-shot)
|
BLEU-4
|
43.7
|
# 4
|
|
Image Captioning
|
COCO Captions
|
BLIP-2 ViT-G OPT 2.7B (zero-shot)
|
CIDER
|
145.8
|
# 4
|
|
Image Captioning
|
COCO Captions
|
BLIP-2 ViT-G FlanT5 XL (zero-shot)
|
BLEU-4
|
42.4
|
# 9
|
|
Image Captioning
|
COCO Captions
|
BLIP-2 ViT-G FlanT5 XL (zero-shot)
|
CIDER
|
144.5
|
# 8
|
|
Image Captioning
|
COCO Captions
|
BLIP-2 ViT-G OPT 6.7B (zero-shot)
|
BLEU-4
|
43.5
|
# 5
|
|
Image Captioning
|
COCO Captions
|
BLIP-2 ViT-G OPT 6.7B (zero-shot)
|
CIDER
|
145.2
|
# 7
|
|
Image-to-Text Retrieval
|
Flickr30k
|
BLIP-2 ViT-L (zero-shot, 1K test set)
|
Recall@1
|
96.9
|
# 2
|
|
Image-to-Text Retrieval
|
Flickr30k
|
BLIP-2 ViT-L (zero-shot, 1K test set)
|
Recall@5
|
100
|
# 1
|
|
Image-to-Text Retrieval
|
Flickr30k
|
BLIP-2 ViT-L (zero-shot, 1K test set)
|
Recall@10
|
100
|
# 1
|
|
Image Retrieval
|
Flickr30k
|
BLIP-2 ViT-L (zero-shot, 1K test set)
|
Recall@5
|
97.6
|
# 2
|
|
Image Retrieval
|
Flickr30k
|
BLIP-2 ViT-L (zero-shot, 1K test set)
|
Recall@10
|
98.9
|
# 1
|
|
Image Retrieval
|
Flickr30k
|
BLIP-2 ViT-L (zero-shot, 1K test set)
|
Recall@1
|
88.6
|
# 2
|
|
Image Retrieval
|
Flickr30k
|
BLIP-2 ViT-G (zero-shot, 1K test set)
|
Recall@5
|
98.1
|
# 1
|
|
Image Retrieval
|
Flickr30k
|
BLIP-2 ViT-G (zero-shot, 1K test set)
|
Recall@10
|
98.9
|
# 1
|
|
Image Retrieval
|
Flickr30k
|
BLIP-2 ViT-G (zero-shot, 1K test set)
|
Recall@1
|
89.7
|
# 1
|
|
Image-to-Text Retrieval
|
Flickr30k
|
BLIP-2 ViT-G (zero-shot, 1K test set)
|
Recall@1
|
97.6
|
# 1
|
|
Image-to-Text Retrieval
|
Flickr30k
|
BLIP-2 ViT-G (zero-shot, 1K test set)
|
Recall@5
|
100
|
# 1
|
|
Image-to-Text Retrieval
|
Flickr30k
|
BLIP-2 ViT-G (zero-shot, 1K test set)
|
Recall@10
|
100
|
# 1
|
|
Visual Question Answering (VQA)
|
GQA test-dev
|
BLIP-2 ViT-L FlanT5 XL (zero-shot)
|
Accuracy
|
44.4
|
# 6
|
|
Visual Question Answering (VQA)
|
GQA test-dev
|
BLIP-2 ViT-G FlanT5 XL (zero-shot)
|
Accuracy
|
44.2
|
# 7
|
|
Visual Question Answering (VQA)
|
GQA test-dev
|
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
|
Accuracy
|
44.7
|
# 5
|
|
Visual Question Answering (VQA)
|
GQA test-dev
|
BLIP-2 ViT-L OPT 2.7B (zero-shot)
|
Accuracy
|
33.9
|
# 11
|
|
Visual Question Answering (VQA)
|
GQA test-dev
|
BLIP-2 ViT-G OPT 2.7B (zero-shot)
|
Accuracy
|
34.6
|
# 10
|
|
Visual Question Answering (VQA)
|
GQA test-dev
|
BLIP-2 ViT-G OPT 6.7B (zero-shot)
|
Accuracy
|
36.4
|
# 9
|
|
Image Captioning
|
nocaps-val-in-domain
|
BLIP-2 ViT-G FlanT5 XL (zero-shot)
|
CIDEr
|
123.7
|
# 1
|
|
Image Captioning
|
nocaps-val-in-domain
|
BLIP-2 ViT-G FlanT5 XL (zero-shot)
|
SPICE
|
16.3
|
# 1
|
|
Image Captioning
|
nocaps-val-in-domain
|
BLIP-2 ViT-G FlanT5 XL (zero-shot)
|
Pre-train (#images)
|
1.1B
|
# 1
|
|
Image Captioning
|
nocaps-val-in-domain
|
BLIP-2 ViT-G OPT 6.7B (zero-shot)
|
CIDEr
|
123.7
|
# 1
|
|
Image Captioning
|
nocaps-val-in-domain
|
BLIP-2 ViT-G OPT 6.7B (zero-shot)
|
SPICE
|
15.8
|
# 2
|
|
Image Captioning
|
nocaps-val-in-domain
|
BLIP-2 ViT-G OPT 6.7B (zero-shot)
|
Pre-train (#images)
|
1.1B
|
# 1
|
|
Image Captioning
|
nocaps-val-in-domain
|
BLIP-2 ViT-G OPT 2.7B (zero-shot)
|
CIDEr
|
123
|
# 3
|
|
Image Captioning
|
nocaps-val-in-domain
|
BLIP-2 ViT-G OPT 2.7B (zero-shot)
|
SPICE
|
15.8
|
# 2
|
|
Image Captioning
|
nocaps-val-in-domain
|
BLIP-2 ViT-G OPT 2.7B (zero-shot)
|
Pre-train (#images)
|
1.1B
|
# 1
|
|
Image Captioning
|
nocaps-val-near-domain
|
BLIP-2 ViT-G FlanT5 XL (zero-shot)
|
CIDEr
|
120.2
|
# 1
|
|
Image Captioning
|
nocaps-val-near-domain
|
BLIP-2 ViT-G FlanT5 XL (zero-shot)
|
SPICE
|
15.9
|
# 1
|
|
Image Captioning
|
nocaps-val-near-domain
|
BLIP-2 ViT-G FlanT5 XL (zero-shot)
|
Pre-train (#images)
|
1.1B
|
# 1
|
|
Image Captioning
|
nocaps-val-near-domain
|
BLIP-2 ViT-G OPT 6.7B (zero-shot)
|
CIDEr
|
119.2
|
# 2
|
|
Image Captioning
|
nocaps-val-near-domain
|
BLIP-2 ViT-G OPT 6.7B (zero-shot)
|
SPICE
|
15.3
|
# 3
|
|
Image Captioning
|
nocaps-val-near-domain
|
BLIP-2 ViT-G OPT 6.7B (zero-shot)
|
Pre-train (#images)
|
1.1B
|
# 1
|
|
Image Captioning
|
nocaps-val-near-domain
|
BLIP-2 ViT-G OPT 2.7B (zero-shot)
|
CIDEr
|
117.8
|
# 3
|
|
Image Captioning
|
nocaps-val-near-domain
|
BLIP-2 ViT-G OPT 2.7B (zero-shot)
|
SPICE
|
15.4
|
# 2
|
|
Image Captioning
|
nocaps-val-near-domain
|
BLIP-2 ViT-G OPT 2.7B (zero-shot)
|
Pre-train (#images)
|
1.1B
|
# 1
|
|
Image Captioning
|
nocaps-val-out-domain
|
BLIP-2 ViT-G OPT 6.7B (zero-shot)
|
CIDEr
|
124.4
|
# 2
|
|
Image Captioning
|
nocaps-val-out-domain
|
BLIP-2 ViT-G OPT 6.7B (zero-shot)
|
SPICE
|
14.8
|
# 3
|
|
Image Captioning
|
nocaps-val-out-domain
|
BLIP-2 ViT-G OPT 6.7B (zero-shot)
|
Pretrain (#images)
|
1.1B
|
# 1
|
|
Image Captioning
|
nocaps-val-out-domain
|
BLIP-2 ViT-G FlanT5 XL (zero-shot)
|
CIDEr
|
124.8
|
# 1
|
|
Image Captioning
|
nocaps-val-out-domain
|
BLIP-2 ViT-G FlanT5 XL (zero-shot)
|
SPICE
|
15.1
|
# 1
|
|
Image Captioning
|
nocaps-val-out-domain
|
BLIP-2 ViT-G FlanT5 XL (zero-shot)
|
Pretrain (#images)
|
1.1B
|
# 1
|
|
Image Captioning
|
nocaps-val-out-domain
|
BLIP-2 ViT-G OPT 2.7B (zero-shot)
|
CIDEr
|
123.4
|
# 3
|
|
Image Captioning
|
nocaps-val-out-domain
|
BLIP-2 ViT-G OPT 2.7B (zero-shot)
|
SPICE
|
15.1
|
# 1
|
|
Image Captioning
|
nocaps-val-out-domain
|
BLIP-2 ViT-G OPT 2.7B (zero-shot)
|
Pretrain (#images)
|
1.1B
|
# 1
|
|
Image Captioning
|
nocaps-val-overall
|
BLIP-2 ViT-G OPT 2.7B (zero-shot)
|
CIDEr
|
119.7
|
# 3
|
|
Image Captioning
|
nocaps-val-overall
|
BLIP-2 ViT-G OPT 2.7B (zero-shot)
|
SPICE
|
15.4
|
# 2
|
|
Image Captioning
|
nocaps-val-overall
|
BLIP-2 ViT-G OPT 2.7B (zero-shot)
|
Pretrain (#images)
|
1.1B
|
# 1
|
|
Image Captioning
|
nocaps-val-overall
|
BLIP-2 ViT-G FlanT5 XL (zero-shot)
|
CIDEr
|
121.6
|
# 1
|
|
Image Captioning
|
nocaps-val-overall
|
BLIP-2 ViT-G FlanT5 XL (zero-shot)
|
SPICE
|
15.8
|
# 1
|
|
Image Captioning
|
nocaps-val-overall
|
BLIP-2 ViT-G FlanT5 XL (zero-shot)
|
Pretrain (#images)
|
1.1B
|
# 1
|
|
Image Captioning
|
nocaps-val-overall
|
BLIP-2 ViT-G OPT 6.7B (zero-shot)
|
CIDEr
|
121.0
|
# 2
|
|
Image Captioning
|
nocaps-val-overall
|
BLIP-2 ViT-G OPT 6.7B (zero-shot)
|
SPICE
|
15.3
|
# 3
|
|
Image Captioning
|
nocaps-val-overall
|
BLIP-2 ViT-G OPT 6.7B (zero-shot)
|
Pretrain (#images)
|
1.1B
|
# 1
|
|
Visual Question Answering (VQA)
|
OK-VQA
|
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
|
Accuracy
|
45.9
|
# 12
|
|
Visual Question Answering (VQA)
|
OK-VQA
|
BLIP-2 ViT-G FlanT5 XL (zero-shot)
|
Accuracy
|
40.7
|
# 17
|
|
Visual Question Answering (VQA)
|
OK-VQA
|
BLIP-2 ViT-L FlanT5 XL (zero-shot)
|
Accuracy
|
39.4
|
# 18
|
|
Visual Question Answering (VQA)
|
OK-VQA
|
BLIP-2 ViT-L OPT 2.7B (zero-shot)
|
Accuracy
|
30.2
|
# 22
|
|
Visual Question Answering (VQA)
|
OK-VQA
|
BLIP-2 ViT-G OPT 2.7B (zero-shot)
|
Accuracy
|
31.7
|
# 21
|
|
Visual Question Answering (VQA)
|
OK-VQA
|
BLIP-2 ViT-G OPT 6.7B (zero-shot)
|
Accuracy
|
36.4
|
# 19
|
|
Visual Question Answering (VQA)
|
VQA v2 test-dev
|
BLIP-2 ViT-G FlanT5 XL (fine-tuned)
|
Accuracy
|
81.66
|
# 10
|
|
Visual Question Answering (VQA)
|
VQA v2 test-dev
|
BLIP-2 ViT-G OPT 2.7B (fine-tuned)
|
Accuracy
|
81.74
|
# 9
|
|
Visual Question Answering (VQA)
|
VQA v2 test-dev
|
BLIP-2 ViT-G OPT 6.7B (fine-tuned)
|
Accuracy
|
82.30
|
# 5
|
|
Visual Question Answering (VQA)
|
VQA v2 test-dev
|
BLIP-2 ViT-L FlanT5 XL (zero-shot)
|
Accuracy
|
62.3
|
# 47
|
|
Visual Question Answering (VQA)
|
VQA v2 test-dev
|
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
|
Accuracy
|
65
|
# 41
|
|
Visual Question Answering (VQA)
|
VQA v2 test-dev
|
BLIP-2 ViT-G FlanT5 XL (zero-shot)
|
Accuracy
|
63
|
# 46
|
|
Visual Question Answering (VQA)
|
VQA v2 test-dev
|
BLIP-2 ViT-G OPT 2.7B (zero-shot)
|
Accuracy
|
52.3
|
# 50
|
|
Visual Question Answering (VQA)
|
VQA v2 test-dev
|
BLIP-2 ViT-G OPT 6.7B (zero-shot)
|
Accuracy
|
52.6
|
# 49
|
|
Visual Question Answering (VQA)
|
VQA v2 test-dev
|
BLIP-2 ViT-L OPT 2.7B (zero-shot)
|
Accuracy
|
49.7
|
# 53
|
|
Visual Question Answering (VQA)
|
VQA v2 val
|
BLIP-2 ViT-G OPT 6.7B (zero-shot)
|
Accuracy
|
54.3
|
# 8
|
|
Visual Question Answering (VQA)
|
VQA v2 val
|
BLIP-2 ViT-G OPT 2.7B (zero-shot)
|
Accuracy
|
53.5
|
# 9
|
|
Visual Question Answering (VQA)
|
VQA v2 val
|
BLIP-2 ViT-L OPT 2.7B (zero-shot)
|
Accuracy
|
50.1
|
# 10
|
|
Visual Question Answering (VQA)
|
VQA v2 val
|
BLIP-2 ViT-G OPT 6.7B (fine-tuned)
|
Accuracy
|
82.19
|
# 1
|
|
Visual Question Answering (VQA)
|
VQA v2 val
|
BLIP-2 ViT-G OPT 2.7B (fine-tuned)
|
Accuracy
|
81.59
|
# 2
|
|
Visual Question Answering (VQA)
|
VQA v2 val
|
BLIP-2 ViT-G FlanT5 XL (fine-tuned)
|
Accuracy
|
81.55
|
# 3
|
|
Visual Question Answering (VQA)
|
VQA v2 val
|
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
|
Accuracy
|
65.2
|
# 4
|
|
Visual Question Answering (VQA)
|
VQA v2 val
|
BLIP-2 ViT-G FlanT5 XL (zero-shot)
|
Accuracy
|
63.1
|
# 6
|
|
Visual Question Answering (VQA)
|
VQA v2 val
|
BLIP-2 ViT-L FlanT5 XL (zero-shot)
|
Accuracy
|
62.6
|
# 7
|
|