TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic Image-Text Similarity	CxC	ALIGN-L2	avg ± std	67.6 ± 1.2	# 1
Semantic Image Similarity	CxC	ALIGN-L2	avg ± std	77.2 ± 0.8	# 2
Semantic Textual Similarity	CxC	ALIGN-L2	avg ± std	72.9 ± 0.4	# 4
Semantic Image-Text Similarity	CxC	MURAL-large	avg ± std	67.1 ± 1.3	# 2
Semantic Image Similarity	CxC	MURAL-large	avg ± std	80.4 ± 0.7	# 1
Semantic Textual Similarity	CxC	MURAL-large	avg ± std	74.1 ± 0.4	# 3
Semantic Image-Text Similarity	CxC	DE-T2T+I2T	avg ± std	61.9	# 3
Semantic Image Similarity	CxC	DE-T2T+I2T	avg ± std	74.5 ± 0.9	# 3
Semantic Textual Similarity	CxC	DE-T2T+I2T	avg ± std	74.5 ± 0.4	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mural-multimodal-multitask-retrieval-across/semantic-image-text-similarity-on-cxc)](https://paperswithcode.com/sota/semantic-image-text-similarity-on-cxc?p=mural-multimodal-multitask-retrieval-across)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mural-multimodal-multitask-retrieval-across/semantic-image-similarity-on-cxc)](https://paperswithcode.com/sota/semantic-image-similarity-on-cxc?p=mural-multimodal-multitask-retrieval-across)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mural-multimodal-multitask-retrieval-across/semantic-textual-similarity-on-cxc)](https://paperswithcode.com/sota/semantic-textual-similarity-on-cxc?p=mural-multimodal-multitask-retrieval-across)`

MURAL: Multimodal, Multitask Retrieval Across Languages

10 Sep 2021 · Aashi Jain, Mandy Guo, Krishna Srinivasan, Ting Chen, Sneha Kudugunta, Chao Jia, Yinfei Yang, Jason Baldridge ·

Both image-caption pairs and translation pairs provide the means to learn deep representations of and connections between languages. We use both types of pairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dual encoder that solves two tasks: 1) image-text matching and 2) translation pair matching. By incorporating billions of translation pairs, MURAL extends ALIGN (Jia et al. PMLR'21)--a state-of-the-art dual encoder learned from 1.8 billion noisy image-text pairs. When using the same encoders, MURAL's performance matches or exceeds ALIGN's cross-modal retrieval performance on well-resourced languages across several datasets. More importantly, it considerably improves performance on under-resourced languages, showing that text-text learning can overcome a paucity of image-caption examples for these languages. On the Wikipedia Image-Text dataset, for example, MURAL-base improves zero-shot mean recall by 8.1% on average for eight under-resourced languages and by 6.8% on average when fine-tuning. We additionally show that MURAL's text representations cluster not only with respect to genealogical connections but also based on areal linguistics, such as the Balkan Sprachbund.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Cross-Modal Retrieval

Image-text matching

Retrieval

Semantic Image Similarity

Semantic Image-Text Similarity

Semantic Textual Similarity

Text Matching

Translation

Datasets

MS COCO

Flickr30k

WIT

CxC

Results from the Paper

Edit

Ranked #1 on Semantic Image-Text Similarity on CxC

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Image-Text Similarity	CxC	ALIGN-L2	avg ± std	67.6 ± 1.2	# 1	Compare
Semantic Image Similarity	CxC	ALIGN-L2	avg ± std	77.2 ± 0.8	# 2	Compare
Semantic Textual Similarity	CxC	ALIGN-L2	avg ± std	72.9 ± 0.4	# 4	Compare
Semantic Image-Text Similarity	CxC	MURAL-large	avg ± std	67.1 ± 1.3	# 2	Compare
Semantic Image Similarity	CxC	MURAL-large	avg ± std	80.4 ± 0.7	# 1	Compare
Semantic Textual Similarity	CxC	MURAL-large	avg ± std	74.1 ± 0.4	# 3	Compare
Semantic Image-Text Similarity	CxC	DE-T2T+I2T	avg ± std	61.9	# 3	Compare
Semantic Image Similarity	CxC	DE-T2T+I2T	avg ± std	74.5 ± 0.9	# 3	Compare
Semantic Textual Similarity	CxC	DE-T2T+I2T	avg ± std	74.5 ± 0.4	# 2	Compare

Methods

Add Remove

ALIGN

Edit Social Preview

MURAL: Multimodal, Multitask Retrieval Across Languages

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove