TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Max-Shot Cross-Lingual Visual Reasoning	MaRVL	M3P	Accuracy (%)	49.79	# 4
Zero-Shot Cross-Lingual Visual Reasoning	MaRVL	UC2	Accuracy (%)	57.28	# 6
Zero-Shot Cross-Lingual Visual Reasoning	MaRVL	xUNITER	Accuracy (%)	54.59	# 9
Max-Shot Cross-Lingual Visual Reasoning	MaRVL	UC2	Accuracy (%)	58.32	# 1
Zero-Shot Cross-Lingual Visual Reasoning	MaRVL	mUNITER	Accuracy (%)	53.72	# 11
Max-Shot Cross-Lingual Visual Reasoning	MaRVL	mUNITER	Accuracy (%)	53.41	# 3
Max-Shot Cross-Lingual Visual Reasoning	MaRVL	xUNITER	Accuracy (%)	57.46	# 2
Zero-Shot Cross-Lingual Visual Reasoning	MaRVL	M3P	Accuracy (%)	56	# 8
Zero-Shot Cross-Lingual Text-to-Image Retrieval	WIT (IGLUE)	M3P	Recall@1 (%)	8.12	# 4
Zero-Shot Cross-Lingual Image-to-Text Retrieval	WIT (IGLUE)	xUNITER	Recall@1 (%)	9.81	# 4
Zero-Shot Cross-Lingual Text-to-Image Retrieval	WIT (IGLUE)	UC2	Recall@1 (%)	7.83	# 5
Zero-Shot Cross-Lingual Image-to-Text Retrieval	WIT (IGLUE)	UC2	Recall@1 (%)	9.09	# 5
Zero-Shot Cross-Lingual Image-to-Text Retrieval	WIT (IGLUE)	M3P	Recall@1 (%)	9.98	# 3
Zero-Shot Cross-Lingual Text-to-Image Retrieval	WIT (IGLUE)	xUNITER	Recall@1 (%)	8.72	# 3
Zero-Shot Cross-Lingual Text-to-Image Retrieval	WIT (IGLUE)	mUNITER	Recall@1 (%)	9.16	# 2
Zero-Shot Cross-Lingual Image-to-Text Retrieval	WIT (IGLUE)	mUNITER	Recall@1 (%)	10.48	# 1
Zero-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	mUNITER	Recall@1 (%)	8.06	# 8
Max-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	M3P	Recall@1 (%)	12.26	# 3
Max-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	M3P	Recall@1 (%)	13.21	# 3
Max-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	UC2	Recall@1 (%)	17.59	# 1
Max-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	UC2	Recall@1 (%)	19.79	# 1
Max-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	xUNITER	Recall@1 (%)	13.54	# 2
Max-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	xUNITER	Recall@1 (%)	14.3	# 2
Max-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	mUNITER	Recall@1 (%)	9.32	# 4
Max-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	mUNITER	Recall@1 (%)	8.54	# 4
Zero-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	M3P	Recall@1 (%)	11.9	# 7
Zero-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	UC2	Recall@1 (%)	17.89	# 5
Zero-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	xUNITER	Recall@1 (%)	13.51	# 6
Zero-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	M3P	Recall@1 (%)	12.91	# 7
Zero-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	UC2	Recall@1 (%)	20.31	# 5
Zero-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	xUNITER	Recall@1 (%)	14.04	# 6
Zero-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	mUNITER	Recall@1 (%)	8.86	# 8
Max-Shot Cross-Lingual Visual Question Answering	xGQA	xUNITER	Accuracy (%)	40.68	# 3
Max-Shot Cross-Lingual Visual Question Answering	xGQA	UC2	Accuracy (%)	42.95	# 1
Max-Shot Cross-Lingual Visual Question Answering	xGQA	M3P	Accuracy (%)	41.04	# 2
Max-Shot Cross-Lingual Visual Question Answering	xGQA	mUNITER	Accuracy (%)	37.21	# 4
Zero-Shot Cross-Lingual Visual Question Answering	xGQA	mUNITER	Accuracy (%)	9.97	# 9
Zero-Shot Cross-Lingual Visual Question Answering	xGQA	M3P	Accuracy (%)	28.17	# 7
Zero-Shot Cross-Lingual Visual Question Answering	xGQA	UC2	Accuracy (%)	29.35	# 6
Zero-Shot Cross-Lingual Visual Question Answering	xGQA	xUNITER	Accuracy (%)	21.72	# 8
Max-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	M3P	Accuracy (%)	59.36	# 3
Zero-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	M3P	Accuracy (%)	58.25	# 8
Max-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	UC2	Accuracy (%)	63.68	# 1
Zero-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	mUNITER	Accuracy (%)	53.69	# 9
Max-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	mUNITER	Accuracy (%)	53.95	# 4
Max-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	xUNITER	Accuracy (%)	60.55	# 2
Zero-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	UC2	Accuracy (%)	62.05	# 6
Zero-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	xUNITER	Accuracy (%)	58.48	# 7

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/iglue-a-benchmark-for-transfer-learning/max-shot-cross-lingual-visual-reasoning-on)](https://paperswithcode.com/sota/max-shot-cross-lingual-visual-reasoning-on?p=iglue-a-benchmark-for-transfer-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/iglue-a-benchmark-for-transfer-learning/zero-shot-cross-lingual-image-to-text-1)](https://paperswithcode.com/sota/zero-shot-cross-lingual-image-to-text-1?p=iglue-a-benchmark-for-transfer-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/iglue-a-benchmark-for-transfer-learning/max-shot-cross-lingual-image-to-text)](https://paperswithcode.com/sota/max-shot-cross-lingual-image-to-text?p=iglue-a-benchmark-for-transfer-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/iglue-a-benchmark-for-transfer-learning/max-shot-cross-lingual-text-to-image)](https://paperswithcode.com/sota/max-shot-cross-lingual-text-to-image?p=iglue-a-benchmark-for-transfer-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/iglue-a-benchmark-for-transfer-learning/max-shot-cross-lingual-visual-question)](https://paperswithcode.com/sota/max-shot-cross-lingual-visual-question?p=iglue-a-benchmark-for-transfer-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/iglue-a-benchmark-for-transfer-learning/max-shot-cross-lingual-visual-natural)](https://paperswithcode.com/sota/max-shot-cross-lingual-visual-natural?p=iglue-a-benchmark-for-transfer-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/iglue-a-benchmark-for-transfer-learning/zero-shot-cross-lingual-text-to-image-1)](https://paperswithcode.com/sota/zero-shot-cross-lingual-text-to-image-1?p=iglue-a-benchmark-for-transfer-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/iglue-a-benchmark-for-transfer-learning/zero-shot-cross-lingual-image-to-text)](https://paperswithcode.com/sota/zero-shot-cross-lingual-image-to-text?p=iglue-a-benchmark-for-transfer-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/iglue-a-benchmark-for-transfer-learning/zero-shot-cross-lingual-text-to-image)](https://paperswithcode.com/sota/zero-shot-cross-lingual-text-to-image?p=iglue-a-benchmark-for-transfer-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/iglue-a-benchmark-for-transfer-learning/zero-shot-cross-lingual-visual-reasoning-on)](https://paperswithcode.com/sota/zero-shot-cross-lingual-visual-reasoning-on?p=iglue-a-benchmark-for-transfer-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/iglue-a-benchmark-for-transfer-learning/zero-shot-cross-lingual-visual-question)](https://paperswithcode.com/sota/zero-shot-cross-lingual-visual-question?p=iglue-a-benchmark-for-transfer-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/iglue-a-benchmark-for-transfer-learning/zero-shot-cross-lingual-visual-natural)](https://paperswithcode.com/sota/zero-shot-cross-lingual-visual-natural?p=iglue-a-benchmark-for-transfer-learning)`

IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages

27 Jan 2022 · Emanuele Bugliarello, Fangyu Liu, Jonas Pfeiffer, Siva Reddy, Desmond Elliott, Edoardo Maria Ponti, Ivan Vulić ·

Reliable evaluation benchmarks designed for replicability and comprehensiveness have driven progress in machine learning. Due to the lack of a multilingual benchmark, however, vision-and-language research has mostly focused on English language tasks. To fill this gap, we introduce the Image-Grounded Language Understanding Evaluation benchmark. IGLUE brings together - by both aggregating pre-existing datasets and creating new ones - visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. Our benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups. Based on the evaluation of the available state-of-the-art models, we find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks. Moreover, downstream performance is partially explained by the amount of available unlabelled textual data for pretraining, and only weakly by the typological distance of target-source languages. We hope to encourage future research efforts in this area by releasing the benchmark to the community.

PDF Abstract

Code

Add Remove Mark official

e-bug/volta official

111

e-bug/iglue official

shin-ee-chen/bla

Tasks

Add Remove

Cross-Modal Retrieval

Few-Shot Learning

Image-to-Text Retrieval

Question Answering

Retrieval

Transfer Learning

Visual Question Answering

Visual Question Answering (VQA)

Datasets

Introduced in the Paper:

IGLUE

Used in the Paper:

MS COCO

SNLI

Flickr30k

GQA

WIT

MaRVL

Results from the Paper

Edit

Ranked #1 on Max-Shot Cross-Lingual Visual Question Answering on xGQA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Max-Shot Cross-Lingual Visual Reasoning	MaRVL	M3P	Accuracy (%)	49.79	# 4	Compare
Zero-Shot Cross-Lingual Visual Reasoning	MaRVL	UC2	Accuracy (%)	57.28	# 6	Compare
Zero-Shot Cross-Lingual Visual Reasoning	MaRVL	xUNITER	Accuracy (%)	54.59	# 9	Compare
Max-Shot Cross-Lingual Visual Reasoning	MaRVL	UC2	Accuracy (%)	58.32	# 1	Compare
Zero-Shot Cross-Lingual Visual Reasoning	MaRVL	mUNITER	Accuracy (%)	53.72	# 11	Compare
Max-Shot Cross-Lingual Visual Reasoning	MaRVL	mUNITER	Accuracy (%)	53.41	# 3	Compare
Max-Shot Cross-Lingual Visual Reasoning	MaRVL	xUNITER	Accuracy (%)	57.46	# 2	Compare
Zero-Shot Cross-Lingual Visual Reasoning	MaRVL	M3P	Accuracy (%)	56	# 8	Compare
Zero-Shot Cross-Lingual Text-to-Image Retrieval	WIT (IGLUE)	M3P	Recall@1 (%)	8.12	# 4	Compare
Zero-Shot Cross-Lingual Image-to-Text Retrieval	WIT (IGLUE)	xUNITER	Recall@1 (%)	9.81	# 4	Compare
Zero-Shot Cross-Lingual Text-to-Image Retrieval	WIT (IGLUE)	UC2	Recall@1 (%)	7.83	# 5	Compare
Zero-Shot Cross-Lingual Image-to-Text Retrieval	WIT (IGLUE)	UC2	Recall@1 (%)	9.09	# 5	Compare
Zero-Shot Cross-Lingual Image-to-Text Retrieval	WIT (IGLUE)	M3P	Recall@1 (%)	9.98	# 3	Compare
Zero-Shot Cross-Lingual Text-to-Image Retrieval	WIT (IGLUE)	xUNITER	Recall@1 (%)	8.72	# 3	Compare
Zero-Shot Cross-Lingual Text-to-Image Retrieval	WIT (IGLUE)	mUNITER	Recall@1 (%)	9.16	# 2	Compare
Zero-Shot Cross-Lingual Image-to-Text Retrieval	WIT (IGLUE)	mUNITER	Recall@1 (%)	10.48	# 1	Compare
Zero-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	mUNITER	Recall@1 (%)	8.06	# 8	Compare
Max-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	M3P	Recall@1 (%)	12.26	# 3	Compare
Max-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	M3P	Recall@1 (%)	13.21	# 3	Compare
Max-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	UC2	Recall@1 (%)	17.59	# 1	Compare
Max-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	UC2	Recall@1 (%)	19.79	# 1	Compare
Max-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	xUNITER	Recall@1 (%)	13.54	# 2	Compare
Max-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	xUNITER	Recall@1 (%)	14.3	# 2	Compare
Max-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	mUNITER	Recall@1 (%)	9.32	# 4	Compare
Max-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	mUNITER	Recall@1 (%)	8.54	# 4	Compare
Zero-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	M3P	Recall@1 (%)	11.9	# 7	Compare
Zero-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	UC2	Recall@1 (%)	17.89	# 5	Compare
Zero-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	xUNITER	Recall@1 (%)	13.51	# 6	Compare
Zero-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	M3P	Recall@1 (%)	12.91	# 7	Compare
Zero-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	UC2	Recall@1 (%)	20.31	# 5	Compare
Zero-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	xUNITER	Recall@1 (%)	14.04	# 6	Compare
Zero-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	mUNITER	Recall@1 (%)	8.86	# 8	Compare
Max-Shot Cross-Lingual Visual Question Answering	xGQA	xUNITER	Accuracy (%)	40.68	# 3	Compare
Max-Shot Cross-Lingual Visual Question Answering	xGQA	UC2	Accuracy (%)	42.95	# 1	Compare
Max-Shot Cross-Lingual Visual Question Answering	xGQA	M3P	Accuracy (%)	41.04	# 2	Compare
Max-Shot Cross-Lingual Visual Question Answering	xGQA	mUNITER	Accuracy (%)	37.21	# 4	Compare
Zero-Shot Cross-Lingual Visual Question Answering	xGQA	mUNITER	Accuracy (%)	9.97	# 9	Compare
Zero-Shot Cross-Lingual Visual Question Answering	xGQA	M3P	Accuracy (%)	28.17	# 7	Compare
Zero-Shot Cross-Lingual Visual Question Answering	xGQA	UC2	Accuracy (%)	29.35	# 6	Compare
Zero-Shot Cross-Lingual Visual Question Answering	xGQA	xUNITER	Accuracy (%)	21.72	# 8	Compare
Max-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	M3P	Accuracy (%)	59.36	# 3	Compare
Zero-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	M3P	Accuracy (%)	58.25	# 8	Compare
Max-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	UC2	Accuracy (%)	63.68	# 1	Compare
Zero-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	mUNITER	Accuracy (%)	53.69	# 9	Compare
Max-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	mUNITER	Accuracy (%)	53.95	# 4	Compare
Max-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	xUNITER	Accuracy (%)	60.55	# 2	Compare
Zero-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	UC2	Accuracy (%)	62.05	# 6	Compare
Zero-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	xUNITER	Accuracy (%)	58.48	# 7	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove