Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

26 Dec 2018Mikel ArtetxeHolger Schwenk

We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts. Our system uses a single BiLSTM encoder with a shared BPE vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora... (read more)

PDF Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT LEADERBOARD
Cross-Lingual Bitext Mining BUCC Chinese-to-English Massively Multilingual Sentence Embeddings F1 score 92.27 # 1
Cross-Lingual Bitext Mining BUCC French-to-English Massively Multilingual Sentence Embeddings F1 score 93.91 # 1
Cross-Lingual Bitext Mining BUCC German-to-English Massively Multilingual Sentence Embeddings F1 score 96.19 # 1
Cross-Lingual Bitext Mining BUCC Russian-to-English Massively Multilingual Sentence Embeddings F1 score 93.3 # 1
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-Chinese Massively Multilingual Sentence Embeddings Accuracy 71.93 # 5
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-French Massively Multilingual Sentence Embeddings Accuracy 77.95 # 3
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-German Massively Multilingual Sentence Embeddings Accuracy 84.78% # 3
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-Italian Massively Multilingual Sentence Embeddings Accuracy 69.43 # 2
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-Japanese Massively Multilingual Sentence Embeddings Accuracy 60.3 # 3
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-Russian Massively Multilingual Sentence Embeddings Accuracy 67.78 # 3
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-Spanish Massively Multilingual Sentence Embeddings Accuracy 77.33 # 3