XD: Cross-lingual Knowledge Distillation for Polyglot Sentence Embeddings

25 Sep 2019 · Maksym Del, Mark Fishel ·

Current state-of-the-art results in multilingual natural language inference (NLI) are based on tuning XLM (a pre-trained polyglot language model) separately for each language involved, resulting in multiple models. We reach significantly higher NLI results with a single model for all languages via multilingual tuning. Furthermore, we introduce cross-lingual knowledge distillation (XD), where the same polyglot model is used both as teacher and student across languages to improve its sentence representations without using the end-task labels. When used alone, XD beats multilingual tuning for some languages and the combination of them both results in a new state-of-the-art of 79.2% on the XNLI dataset, surpassing the previous result by absolute 2.5%. The models and code for reproducing our experiments will be made publicly available after de-anonymization.

PDF Abstract