Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

26 Dec 2018Mikel ArtetxeHolger Schwenk

We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different language families and written in 28 different scripts. Our system uses a single BiLSTM encoder with a shared BPE vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora... (read more)

PDF Abstract
Task Dataset Model Metric name Metric value Global rank Compare
Cross-Lingual Bitext Mining BUCC French-to-English Massively Multilingual Sentence Embeddings F1 score 93.91 # 1
Cross-Lingual Bitext Mining BUCC German-to-English Massively Multilingual Sentence Embeddings F1 score 96.19 # 1
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-French Massively Multilingual Sentence Embeddings Accuracy 77.95% # 1
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-German Massively Multilingual Sentence Embeddings Accuracy 84.78% # 1
Cross-Lingual Document Classification MLDoc Zero-Shot English-to-Spanish Massively Multilingual Sentence Embeddings Accuracy 77.33% # 1
Cross-Lingual Natural Language Inference XNLI Zero-Shot English-to-French BiLSTM Accuracy 71.9% # 1
Cross-Lingual Natural Language Inference XNLI Zero-Shot English-to-German BiLSTM Accuracy 72.6% # 1
Cross-Lingual Natural Language Inference XNLI Zero-Shot English-to-Spanish BiLSTM Accuracy 72.9% # 2