Cross-lingual and Multilingual CLIP

The long-standing endeavor of relating the textual and the visual domain recently underwent a pivotal breakthrough, as OpenAI released CLIP. This model distinguishes how well an English text corresponds with a given image with unprecedented accuracy. Trained via a contrastive learning objective over a huge dataset of 400M of images and captions, it is a work that is not easily replicated, especially for low resource languages. Capitalizing on the modularization of the CLIP architecture, we propose to use cross-lingual teacher learning to re-train the textual encoder for various non-English languages. Our method requires no image data and relies entirely on machine translation which removes the need for data in the target language. We find that our method can efficiently train a new textual encoder with relatively low computational cost, whilst still outperforming previous baselines on multilingual image-text retrieval.

PDF Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Zero-shot Image Retrieval XTD10 M-CLIP(ViT-B32) EN-Recall@10 91.8 # 4
ES-Recall@10 89.1 # 4
FR-Recall@10 89.4 # 4
ZH-Recall@10 89.3 # 4
KO-Recall@10 82.1 # 4
RU-Recall@10 86.1 # 4
JA-Recall@10 81 # 4
IT-Recall@10 89.8 # 4

Methods


No methods listed for this paper. Add relevant methods here