IGLUE (Image-Grounded Language Understanding Evaluation)

Introduced by Bugliarello et al. in IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages

The Image-Grounded Language Understanding Evaluation (IGLUE) benchmark brings together—by both aggregating pre-existing datasets and creating new ones—visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. The benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups.

Homepage

Benchmarks

Add a new result Link an existing benchmark

Task	Dataset Variant	Best Model
Zero-Shot Cross-Lingual Visual Reasoning	MaRVL	CCLM-X2VLM-large
Zero-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	CCLM-X2VLM-large
Zero-Shot Cross-Lingual Visual Question Answering	xGQA	CCLM-X2VLM-large
Zero-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	CCLM-X2VLM-large
Zero-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	CCLM-X2VLM-large
Zero-Shot Cross-Lingual Image-to-Text Retrieval	WIT (IGLUE)	mUNITER
Zero-Shot Cross-Lingual Text-to-Image Retrieval	WIT (IGLUE)	TD-MML
Max-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	UC2
Max-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	UC2
Max-Shot Cross-Lingual Visual Question Answering	xGQA	UC2
Max-Shot Cross-Lingual Visual Reasoning	MaRVL	UC2
Max-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	UC2
Zero-Shot Cross-Lingual Transfer	MaRVL	xUNITER