TASK |
DATASET |
MODEL |
METRIC NAME |
METRIC VALUE |
GLOBAL RANK |
REMOVE |
Max-Shot Cross-Lingual Visual Reasoning
|
MaRVL
|
M3P
|
Accuracy (%)
|
49.79
|
# 4
|
|
Zero-Shot Cross-Lingual Visual Reasoning
|
MaRVL
|
UC2
|
Accuracy (%)
|
57.28
|
# 6
|
|
Zero-Shot Cross-Lingual Visual Reasoning
|
MaRVL
|
xUNITER
|
Accuracy (%)
|
54.59
|
# 9
|
|
Max-Shot Cross-Lingual Visual Reasoning
|
MaRVL
|
UC2
|
Accuracy (%)
|
58.32
|
# 1
|
|
Zero-Shot Cross-Lingual Visual Reasoning
|
MaRVL
|
mUNITER
|
Accuracy (%)
|
53.72
|
# 11
|
|
Max-Shot Cross-Lingual Visual Reasoning
|
MaRVL
|
mUNITER
|
Accuracy (%)
|
53.41
|
# 3
|
|
Max-Shot Cross-Lingual Visual Reasoning
|
MaRVL
|
xUNITER
|
Accuracy (%)
|
57.46
|
# 2
|
|
Zero-Shot Cross-Lingual Visual Reasoning
|
MaRVL
|
M3P
|
Accuracy (%)
|
56
|
# 8
|
|
Zero-Shot Cross-Lingual Text-to-Image Retrieval
|
WIT (IGLUE)
|
M3P
|
Recall@1 (%)
|
8.12
|
# 4
|
|
Zero-Shot Cross-Lingual Image-to-Text Retrieval
|
WIT (IGLUE)
|
xUNITER
|
Recall@1 (%)
|
9.81
|
# 4
|
|
Zero-Shot Cross-Lingual Text-to-Image Retrieval
|
WIT (IGLUE)
|
UC2
|
Recall@1 (%)
|
7.83
|
# 5
|
|
Zero-Shot Cross-Lingual Image-to-Text Retrieval
|
WIT (IGLUE)
|
UC2
|
Recall@1 (%)
|
9.09
|
# 5
|
|
Zero-Shot Cross-Lingual Image-to-Text Retrieval
|
WIT (IGLUE)
|
M3P
|
Recall@1 (%)
|
9.98
|
# 3
|
|
Zero-Shot Cross-Lingual Text-to-Image Retrieval
|
WIT (IGLUE)
|
xUNITER
|
Recall@1 (%)
|
8.72
|
# 3
|
|
Zero-Shot Cross-Lingual Text-to-Image Retrieval
|
WIT (IGLUE)
|
mUNITER
|
Recall@1 (%)
|
9.16
|
# 2
|
|
Zero-Shot Cross-Lingual Image-to-Text Retrieval
|
WIT (IGLUE)
|
mUNITER
|
Recall@1 (%)
|
10.48
|
# 1
|
|
Zero-Shot Cross-Lingual Text-to-Image Retrieval
|
xFlickr&CO
|
mUNITER
|
Recall@1 (%)
|
8.06
|
# 8
|
|
Max-Shot Cross-Lingual Image-to-Text Retrieval
|
xFlickr&CO
|
M3P
|
Recall@1 (%)
|
12.26
|
# 3
|
|
Max-Shot Cross-Lingual Text-to-Image Retrieval
|
xFlickr&CO
|
M3P
|
Recall@1 (%)
|
13.21
|
# 3
|
|
Max-Shot Cross-Lingual Image-to-Text Retrieval
|
xFlickr&CO
|
UC2
|
Recall@1 (%)
|
17.59
|
# 1
|
|
Max-Shot Cross-Lingual Text-to-Image Retrieval
|
xFlickr&CO
|
UC2
|
Recall@1 (%)
|
19.79
|
# 1
|
|
Max-Shot Cross-Lingual Image-to-Text Retrieval
|
xFlickr&CO
|
xUNITER
|
Recall@1 (%)
|
13.54
|
# 2
|
|
Max-Shot Cross-Lingual Text-to-Image Retrieval
|
xFlickr&CO
|
xUNITER
|
Recall@1 (%)
|
14.3
|
# 2
|
|
Max-Shot Cross-Lingual Image-to-Text Retrieval
|
xFlickr&CO
|
mUNITER
|
Recall@1 (%)
|
9.32
|
# 4
|
|
Max-Shot Cross-Lingual Text-to-Image Retrieval
|
xFlickr&CO
|
mUNITER
|
Recall@1 (%)
|
8.54
|
# 4
|
|
Zero-Shot Cross-Lingual Image-to-Text Retrieval
|
xFlickr&CO
|
M3P
|
Recall@1 (%)
|
11.9
|
# 7
|
|
Zero-Shot Cross-Lingual Image-to-Text Retrieval
|
xFlickr&CO
|
UC2
|
Recall@1 (%)
|
17.89
|
# 5
|
|
Zero-Shot Cross-Lingual Image-to-Text Retrieval
|
xFlickr&CO
|
xUNITER
|
Recall@1 (%)
|
13.51
|
# 6
|
|
Zero-Shot Cross-Lingual Text-to-Image Retrieval
|
xFlickr&CO
|
M3P
|
Recall@1 (%)
|
12.91
|
# 7
|
|
Zero-Shot Cross-Lingual Text-to-Image Retrieval
|
xFlickr&CO
|
UC2
|
Recall@1 (%)
|
20.31
|
# 5
|
|
Zero-Shot Cross-Lingual Text-to-Image Retrieval
|
xFlickr&CO
|
xUNITER
|
Recall@1 (%)
|
14.04
|
# 6
|
|
Zero-Shot Cross-Lingual Image-to-Text Retrieval
|
xFlickr&CO
|
mUNITER
|
Recall@1 (%)
|
8.86
|
# 8
|
|
Max-Shot Cross-Lingual Visual Question Answering
|
xGQA
|
xUNITER
|
Accuracy (%)
|
40.68
|
# 3
|
|
Max-Shot Cross-Lingual Visual Question Answering
|
xGQA
|
UC2
|
Accuracy (%)
|
42.95
|
# 1
|
|
Max-Shot Cross-Lingual Visual Question Answering
|
xGQA
|
M3P
|
Accuracy (%)
|
41.04
|
# 2
|
|
Max-Shot Cross-Lingual Visual Question Answering
|
xGQA
|
mUNITER
|
Accuracy (%)
|
37.21
|
# 4
|
|
Zero-Shot Cross-Lingual Visual Question Answering
|
xGQA
|
mUNITER
|
Accuracy (%)
|
9.97
|
# 9
|
|
Zero-Shot Cross-Lingual Visual Question Answering
|
xGQA
|
M3P
|
Accuracy (%)
|
28.17
|
# 7
|
|
Zero-Shot Cross-Lingual Visual Question Answering
|
xGQA
|
UC2
|
Accuracy (%)
|
29.35
|
# 6
|
|
Zero-Shot Cross-Lingual Visual Question Answering
|
xGQA
|
xUNITER
|
Accuracy (%)
|
21.72
|
# 8
|
|
Max-Shot Cross-Lingual Visual Natural Language Inference
|
XVNLI
|
M3P
|
Accuracy (%)
|
59.36
|
# 3
|
|
Zero-Shot Cross-Lingual Visual Natural Language Inference
|
XVNLI
|
M3P
|
Accuracy (%)
|
58.25
|
# 8
|
|
Max-Shot Cross-Lingual Visual Natural Language Inference
|
XVNLI
|
UC2
|
Accuracy (%)
|
63.68
|
# 1
|
|
Zero-Shot Cross-Lingual Visual Natural Language Inference
|
XVNLI
|
mUNITER
|
Accuracy (%)
|
53.69
|
# 9
|
|
Max-Shot Cross-Lingual Visual Natural Language Inference
|
XVNLI
|
mUNITER
|
Accuracy (%)
|
53.95
|
# 4
|
|
Max-Shot Cross-Lingual Visual Natural Language Inference
|
XVNLI
|
xUNITER
|
Accuracy (%)
|
60.55
|
# 2
|
|
Zero-Shot Cross-Lingual Visual Natural Language Inference
|
XVNLI
|
UC2
|
Accuracy (%)
|
62.05
|
# 6
|
|
Zero-Shot Cross-Lingual Visual Natural Language Inference
|
XVNLI
|
xUNITER
|
Accuracy (%)
|
58.48
|
# 7
|
|