TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-shot Image Retrieval	XTD10	M-CLIP(ViT-B32)	EN-Recall@10	91.8	# 4
Zero-shot Image Retrieval	XTD10	M-CLIP(ViT-B32)	ES-Recall@10	89.1	# 4
Zero-shot Image Retrieval	XTD10	M-CLIP(ViT-B32)	FR-Recall@10	89.4	# 4
Zero-shot Image Retrieval	XTD10	M-CLIP(ViT-B32)	ZH-Recall@10	89.3	# 4
Zero-shot Image Retrieval	XTD10	M-CLIP(ViT-B32)	KO-Recall@10	82.1	# 4
Zero-shot Image Retrieval	XTD10	M-CLIP(ViT-B32)	RU-Recall@10	86.1	# 4
Zero-shot Image Retrieval	XTD10	M-CLIP(ViT-B32)	JA-Recall@10	81	# 4
Zero-shot Image Retrieval	XTD10	M-CLIP(ViT-B32)	IT-Recall@10	89.8	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cross-lingual-and-multilingual-clip/zero-shot-image-retrieval-on-xtd10)](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-xtd10?p=cross-lingual-and-multilingual-clip)`

Cross-lingual and Multilingual CLIP

LREC 2022 · Fredrik Carlsson, Philipp Eisen, Faton Rekathati, Magnus Sahlgren ·

The long-standing endeavor of relating the textual and the visual domain recently underwent a pivotal breakthrough, as OpenAI released CLIP. This model distinguishes how well an English text corresponds with a given image with unprecedented accuracy. Trained via a contrastive learning objective over a huge dataset of 400M of images and captions, it is a work that is not easily replicated, especially for low resource languages. Capitalizing on the modularization of the CLIP architecture, we propose to use cross-lingual teacher learning to re-train the textual encoder for various non-English languages. Our method requires no image data and relies entirely on machine translation which removes the need for data in the target language. We find that our method can efficiently train a new textual encoder with relatively low computational cost, whilst still outperforming previous baselines on multilingual image-text retrieval.

PDF Abstract

Code

Add Remove Mark official

FreddeFrallan/Multilingual-CLIP official

↳ Quickstart in

Colab

710

Tasks

Add Remove

Contrastive Learning

Machine Translation

Retrieval

Text Retrieval

Zero-shot Image Retrieval

Datasets

Flickr30k XTD10

Results from the Paper

Add Remove

Ranked #4 on Zero-shot Image Retrieval on XTD10

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-shot Image Retrieval	XTD10	M-CLIP(ViT-B32)	EN-Recall@10	91.8	# 4	Compare
			ES-Recall@10	89.1	# 4	Compare
			FR-Recall@10	89.4	# 4	Compare
			ZH-Recall@10	89.3	# 4	Compare
			KO-Recall@10	82.1	# 4	Compare
			RU-Recall@10	86.1	# 4	Compare
			JA-Recall@10	81	# 4	Compare
			IT-Recall@10	89.8	# 4	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Cross-lingual and Multilingual CLIP

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove