TASK |
DATASET |
MODEL |
METRIC NAME |
METRIC VALUE |
GLOBAL RANK |
REMOVE |
Image Captioning
|
COCO Captions
|
VinVL
|
BLEU-4
|
41.0
|
# 15
|
|
Image Captioning
|
COCO Captions
|
VinVL
|
METEOR
|
31.1
|
# 10
|
|
Image Captioning
|
COCO Captions
|
VinVL
|
CIDER
|
140.9
|
# 16
|
|
Image Captioning
|
COCO Captions
|
VinVL
|
SPICE
|
25.2
|
# 8
|
|
Image-text matching
|
CommercialAdsDataset
|
VinVL
|
ADD(S) AUC
|
88.56
|
# 2
|
|
Visual Question Answering (VQA)
|
GQA Test2019
|
Single Model
|
Accuracy
|
64.65
|
# 11
|
|
Visual Question Answering (VQA)
|
GQA Test2019
|
Single Model
|
Binary
|
82.63
|
# 3
|
|
Visual Question Answering (VQA)
|
GQA Test2019
|
Single Model
|
Open
|
48.77
|
# 14
|
|
Visual Question Answering (VQA)
|
GQA Test2019
|
Single Model
|
Consistency
|
94.35
|
# 4
|
|
Visual Question Answering (VQA)
|
GQA Test2019
|
Single Model
|
Plausibility
|
84.98
|
# 25
|
|
Visual Question Answering (VQA)
|
GQA Test2019
|
Single Model
|
Validity
|
96.62
|
# 7
|
|
Visual Question Answering (VQA)
|
GQA Test2019
|
Single Model
|
Distribution
|
4.72
|
# 114
|
|
Image Captioning
|
nocaps entire
|
VinVL (Microsoft Cognitive Services + MSR)
|
CIDEr
|
92.46
|
# 13
|
|
Image Captioning
|
nocaps entire
|
VinVL (Microsoft Cognitive Services + MSR)
|
B1
|
81.59
|
# 11
|
|
Image Captioning
|
nocaps entire
|
VinVL (Microsoft Cognitive Services + MSR)
|
B2
|
65.15
|
# 11
|
|
Image Captioning
|
nocaps entire
|
VinVL (Microsoft Cognitive Services + MSR)
|
B3
|
45.04
|
# 13
|
|
Image Captioning
|
nocaps entire
|
VinVL (Microsoft Cognitive Services + MSR)
|
B4
|
26.15
|
# 13
|
|
Image Captioning
|
nocaps entire
|
VinVL (Microsoft Cognitive Services + MSR)
|
ROUGE-L
|
56.96
|
# 12
|
|
Image Captioning
|
nocaps entire
|
VinVL (Microsoft Cognitive Services + MSR)
|
METEOR
|
27.57
|
# 13
|
|
Image Captioning
|
nocaps entire
|
VinVL (Microsoft Cognitive Services + MSR)
|
SPICE
|
13.07
|
# 13
|
|
Image Captioning
|
nocaps in-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
CIDEr
|
97.99
|
# 15
|
|
Image Captioning
|
nocaps in-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
B1
|
83.24
|
# 11
|
|
Image Captioning
|
nocaps in-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
B2
|
68.04
|
# 12
|
|
Image Captioning
|
nocaps in-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
B3
|
49.68
|
# 14
|
|
Image Captioning
|
nocaps in-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
B4
|
30.62
|
# 14
|
|
Image Captioning
|
nocaps in-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
ROUGE-L
|
58.54
|
# 14
|
|
Image Captioning
|
nocaps in-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
METEOR
|
29.51
|
# 13
|
|
Image Captioning
|
nocaps in-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
SPICE
|
13.63
|
# 14
|
|
Image Captioning
|
nocaps near-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
CIDEr
|
95.16
|
# 13
|
|
Image Captioning
|
nocaps near-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
B1
|
82.77
|
# 11
|
|
Image Captioning
|
nocaps near-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
B2
|
66.94
|
# 11
|
|
Image Captioning
|
nocaps near-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
B3
|
47.02
|
# 13
|
|
Image Captioning
|
nocaps near-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
B4
|
27.97
|
# 13
|
|
Image Captioning
|
nocaps near-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
ROUGE-L
|
57.95
|
# 13
|
|
Image Captioning
|
nocaps near-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
METEOR
|
28.24
|
# 14
|
|
Image Captioning
|
nocaps near-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
SPICE
|
13.36
|
# 15
|
|
Image Captioning
|
nocaps out-of-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
CIDEr
|
78.01
|
# 15
|
|
Image Captioning
|
nocaps out-of-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
B1
|
75.78
|
# 14
|
|
Image Captioning
|
nocaps out-of-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
B2
|
56.1
|
# 16
|
|
Image Captioning
|
nocaps out-of-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
B3
|
34.02
|
# 16
|
|
Image Captioning
|
nocaps out-of-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
B4
|
15.86
|
# 17
|
|
Image Captioning
|
nocaps out-of-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
ROUGE-L
|
51.99
|
# 14
|
|
Image Captioning
|
nocaps out-of-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
METEOR
|
23.55
|
# 17
|
|
Image Captioning
|
nocaps out-of-domain
|
VinVL (Microsoft Cognitive Services + MSR)
|
SPICE
|
11.48
|
# 15
|
|
Image Captioning
|
nocaps-val-in-domain
|
VinVL
|
CIDEr
|
103.1
|
# 10
|
|
Image Captioning
|
nocaps-val-in-domain
|
VinVL
|
SPICE
|
14.2
|
# 9
|
|
Image Captioning
|
nocaps-val-in-domain
|
VinVL
|
Pre-train (#images)
|
5.7M
|
# 5
|
|
Image Captioning
|
nocaps-val-near-domain
|
VinVL
|
CIDEr
|
96.1
|
# 9
|
|
Image Captioning
|
nocaps-val-near-domain
|
VinVL
|
SPICE
|
13.8
|
# 8
|
|
Image Captioning
|
nocaps-val-near-domain
|
VinVL
|
Pre-train (#images)
|
5.7M
|
# 6
|
|
Image Captioning
|
nocaps-val-out-domain
|
VinVL
|
CIDEr
|
88.3
|
# 10
|
|
Image Captioning
|
nocaps-val-out-domain
|
VinVL
|
SPICE
|
12.1
|
# 8
|
|
Image Captioning
|
nocaps-val-out-domain
|
VinVL
|
Pretrain (#images)
|
5.7M
|
# 6
|
|
Image Captioning
|
nocaps-val-overall
|
VinVL
|
CIDEr
|
95.5
|
# 9
|
|
Image Captioning
|
nocaps-val-overall
|
VinVL
|
SPICE
|
13.5
|
# 8
|
|
Image Captioning
|
nocaps-val-overall
|
VinVL
|
Pretrain (#images)
|
5.7M
|
# 6
|
|
Visual Question Answering (VQA)
|
VQA v2 test-std
|
MSR + MS Cog. Svcs.
|
overall
|
76.63
|
# 13
|
|
Visual Question Answering (VQA)
|
VQA v2 test-std
|
MSR + MS Cog. Svcs.
|
yes/no
|
92.04
|
# 6
|
|
Visual Question Answering (VQA)
|
VQA v2 test-std
|
MSR + MS Cog. Svcs.
|
number
|
61.5
|
# 5
|
|
Visual Question Answering (VQA)
|
VQA v2 test-std
|
MSR + MS Cog. Svcs.
|
other
|
66.68
|
# 6
|
|
Visual Question Answering (VQA)
|
VQA v2 test-std
|
MSR + MS Cog. Svcs., X10 models
|
overall
|
77.45
|
# 12
|
|
Visual Question Answering (VQA)
|
VQA v2 test-std
|
MSR + MS Cog. Svcs., X10 models
|
yes/no
|
92.38
|
# 5
|
|
Visual Question Answering (VQA)
|
VQA v2 test-std
|
MSR + MS Cog. Svcs., X10 models
|
number
|
62.55
|
# 4
|
|
Visual Question Answering (VQA)
|
VQA v2 test-std
|
MSR + MS Cog. Svcs., X10 models
|
other
|
67.87
|
# 5
|
|