IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages

Reliable evaluation benchmarks designed for replicability and comprehensiveness have driven progress in machine learning. Due to the lack of a multilingual benchmark, however, vision-and-language research has mostly focused on English language tasks. To fill this gap, we introduce the Image-Grounded Language Understanding Evaluation benchmark. IGLUE brings together - by both aggregating pre-existing datasets and creating new ones - visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. Our benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups. Based on the evaluation of the available state-of-the-art models, we find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks. Moreover, downstream performance is partially explained by the amount of available unlabelled textual data for pretraining, and only weakly by the typological distance of target-source languages. We hope to encourage future research efforts in this area by releasing the benchmark to the community.

PDF Abstract

Datasets


Introduced in the Paper:

IGLUE

Used in the Paper:

MS COCO SNLI Flickr30k GQA WIT MaRVL
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Max-Shot Cross-Lingual Visual Reasoning MaRVL M3P Accuracy (%) 49.79 # 4
Zero-Shot Cross-Lingual Visual Reasoning MaRVL UC2 Accuracy (%) 57.28 # 6
Zero-Shot Cross-Lingual Visual Reasoning MaRVL xUNITER Accuracy (%) 54.59 # 9
Max-Shot Cross-Lingual Visual Reasoning MaRVL UC2 Accuracy (%) 58.32 # 1
Zero-Shot Cross-Lingual Visual Reasoning MaRVL mUNITER Accuracy (%) 53.72 # 11
Max-Shot Cross-Lingual Visual Reasoning MaRVL mUNITER Accuracy (%) 53.41 # 3
Max-Shot Cross-Lingual Visual Reasoning MaRVL xUNITER Accuracy (%) 57.46 # 2
Zero-Shot Cross-Lingual Visual Reasoning MaRVL M3P Accuracy (%) 56 # 8
Zero-Shot Cross-Lingual Text-to-Image Retrieval WIT (IGLUE) M3P Recall@1 (%) 8.12 # 4
Zero-Shot Cross-Lingual Image-to-Text Retrieval WIT (IGLUE) xUNITER Recall@1 (%) 9.81 # 4
Zero-Shot Cross-Lingual Text-to-Image Retrieval WIT (IGLUE) UC2 Recall@1 (%) 7.83 # 5
Zero-Shot Cross-Lingual Image-to-Text Retrieval WIT (IGLUE) UC2 Recall@1 (%) 9.09 # 5
Zero-Shot Cross-Lingual Image-to-Text Retrieval WIT (IGLUE) M3P Recall@1 (%) 9.98 # 3
Zero-Shot Cross-Lingual Text-to-Image Retrieval WIT (IGLUE) xUNITER Recall@1 (%) 8.72 # 3
Zero-Shot Cross-Lingual Text-to-Image Retrieval WIT (IGLUE) mUNITER Recall@1 (%) 9.16 # 2
Zero-Shot Cross-Lingual Image-to-Text Retrieval WIT (IGLUE) mUNITER Recall@1 (%) 10.48 # 1
Zero-Shot Cross-Lingual Text-to-Image Retrieval xFlickr&CO mUNITER Recall@1 (%) 8.06 # 8
Max-Shot Cross-Lingual Image-to-Text Retrieval xFlickr&CO M3P Recall@1 (%) 12.26 # 3
Max-Shot Cross-Lingual Text-to-Image Retrieval xFlickr&CO M3P Recall@1 (%) 13.21 # 3
Max-Shot Cross-Lingual Image-to-Text Retrieval xFlickr&CO UC2 Recall@1 (%) 17.59 # 1
Max-Shot Cross-Lingual Text-to-Image Retrieval xFlickr&CO UC2 Recall@1 (%) 19.79 # 1
Max-Shot Cross-Lingual Image-to-Text Retrieval xFlickr&CO xUNITER Recall@1 (%) 13.54 # 2
Max-Shot Cross-Lingual Text-to-Image Retrieval xFlickr&CO xUNITER Recall@1 (%) 14.3 # 2
Max-Shot Cross-Lingual Image-to-Text Retrieval xFlickr&CO mUNITER Recall@1 (%) 9.32 # 4
Max-Shot Cross-Lingual Text-to-Image Retrieval xFlickr&CO mUNITER Recall@1 (%) 8.54 # 4
Zero-Shot Cross-Lingual Image-to-Text Retrieval xFlickr&CO M3P Recall@1 (%) 11.9 # 7
Zero-Shot Cross-Lingual Image-to-Text Retrieval xFlickr&CO UC2 Recall@1 (%) 17.89 # 5
Zero-Shot Cross-Lingual Image-to-Text Retrieval xFlickr&CO xUNITER Recall@1 (%) 13.51 # 6
Zero-Shot Cross-Lingual Text-to-Image Retrieval xFlickr&CO M3P Recall@1 (%) 12.91 # 7
Zero-Shot Cross-Lingual Text-to-Image Retrieval xFlickr&CO UC2 Recall@1 (%) 20.31 # 5
Zero-Shot Cross-Lingual Text-to-Image Retrieval xFlickr&CO xUNITER Recall@1 (%) 14.04 # 6
Zero-Shot Cross-Lingual Image-to-Text Retrieval xFlickr&CO mUNITER Recall@1 (%) 8.86 # 8
Max-Shot Cross-Lingual Visual Question Answering xGQA xUNITER Accuracy (%) 40.68 # 3
Max-Shot Cross-Lingual Visual Question Answering xGQA UC2 Accuracy (%) 42.95 # 1
Max-Shot Cross-Lingual Visual Question Answering xGQA M3P Accuracy (%) 41.04 # 2
Max-Shot Cross-Lingual Visual Question Answering xGQA mUNITER Accuracy (%) 37.21 # 4
Zero-Shot Cross-Lingual Visual Question Answering xGQA mUNITER Accuracy (%) 9.97 # 9
Zero-Shot Cross-Lingual Visual Question Answering xGQA M3P Accuracy (%) 28.17 # 7
Zero-Shot Cross-Lingual Visual Question Answering xGQA UC2 Accuracy (%) 29.35 # 6
Zero-Shot Cross-Lingual Visual Question Answering xGQA xUNITER Accuracy (%) 21.72 # 8
Max-Shot Cross-Lingual Visual Natural Language Inference XVNLI M3P Accuracy (%) 59.36 # 3
Zero-Shot Cross-Lingual Visual Natural Language Inference XVNLI M3P Accuracy (%) 58.25 # 8
Max-Shot Cross-Lingual Visual Natural Language Inference XVNLI UC2 Accuracy (%) 63.68 # 1
Zero-Shot Cross-Lingual Visual Natural Language Inference XVNLI mUNITER Accuracy (%) 53.69 # 9
Max-Shot Cross-Lingual Visual Natural Language Inference XVNLI mUNITER Accuracy (%) 53.95 # 4
Max-Shot Cross-Lingual Visual Natural Language Inference XVNLI xUNITER Accuracy (%) 60.55 # 2
Zero-Shot Cross-Lingual Visual Natural Language Inference XVNLI UC2 Accuracy (%) 62.05 # 6
Zero-Shot Cross-Lingual Visual Natural Language Inference XVNLI xUNITER Accuracy (%) 58.48 # 7

Methods


No methods listed for this paper. Add relevant methods here