MURAL: Multimodal, Multitask Retrieval Across Languages

Both image-caption pairs and translation pairs provide the means to learn deep representations of and connections between languages. We use both types of pairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dual encoder that solves two tasks: 1) image-text matching and 2) translation pair matching. By incorporating billions of translation pairs, MURAL extends ALIGN (Jia et al. PMLR'21)--a state-of-the-art dual encoder learned from 1.8 billion noisy image-text pairs. When using the same encoders, MURAL's performance matches or exceeds ALIGN's cross-modal retrieval performance on well-resourced languages across several datasets. More importantly, it considerably improves performance on under-resourced languages, showing that text-text learning can overcome a paucity of image-caption examples for these languages. On the Wikipedia Image-Text dataset, for example, MURAL-base improves zero-shot mean recall by 8.1% on average for eight under-resourced languages and by 6.8% on average when fine-tuning. We additionally show that MURAL's text representations cluster not only with respect to genealogical connections but also based on areal linguistics, such as the Balkan Sprachbund.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Semantic Image-Text Similarity CxC ALIGN-L2 avg ± std 67.6 ± 1.2 # 1
Semantic Image Similarity CxC ALIGN-L2 avg ± std 77.2 ± 0.8 # 2
Semantic Textual Similarity CxC ALIGN-L2 avg ± std 72.9 ± 0.4 # 4
Semantic Image-Text Similarity CxC MURAL-large avg ± std 67.1 ± 1.3 # 2
Semantic Image Similarity CxC MURAL-large avg ± std 80.4 ± 0.7 # 1
Semantic Textual Similarity CxC MURAL-large avg ± std 74.1 ± 0.4 # 3
Semantic Image-Text Similarity CxC DE-T2T+I2T avg ± std 61.9 # 3
Semantic Image Similarity CxC DE-T2T+I2T avg ± std 74.5 ± 0.9 # 3
Semantic Textual Similarity CxC DE-T2T+I2T avg ± std 74.5 ± 0.4 # 2

Methods