Does Transliteration Help Multilingual Language Modeling?

29 Jan 2022  ·  Ibraheem Muhammad Moosa, Mahmud Elahi Akhter, Ashfia Binte Habib ·

Script diversity presents a challenge to Multilingual Language Models (MLLM) by reducing lexical overlap among closely related languages. Therefore, transliterating closely related languages that use different writing scripts to a common script may improve the downstream task performance of MLLMs. We empirically measure the effect of transliteration on MLLMs in this context. We specifically focus on the Indic languages, which have the highest script diversity in the world, and we evaluate our models on the IndicGLUE benchmark. We perform the Mann-Whitney U test to rigorously verify whether the effect of transliteration is significant or not. We find that transliteration benefits the low-resource languages without negatively affecting the comparatively high-resource languages. We also measure the cross-lingual representation similarity of the models using centered kernel alignment on parallel sentences from the FLORES-101 dataset. We find that for parallel sentences across different languages, the transliteration-based model learns sentence representations that are more similar.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
News Classification BBC Hindi News Article Classification xlmindic-base-multiscript Accuracy 77.28 # 2
News Classification BBC Hindi News Article Classification xlmindic-base-uniscript Accuracy 79.14 # 1
Sentiment Analysis IITP Movie Reviews Sentiment xlmindic-base-uniscript Accuracy 66.34 # 1
Sentiment Analysis IITP Movie Reviews Sentiment xlmindic-base-multiscript Accuracy 65.91 # 2
Sentiment Analysis IITP Product Reviews Sentiment xlmindic-base-uniscript Accuracy 77.18 # 2
Sentiment Analysis IITP Product Reviews Sentiment xlmindic-base-multiscript Accuracy 76.33 # 3
Multiple Choice Question Answering (MCQA) IndicGLUE WSTP Pa xlmindic-base-multiscript Accuracy 74.33 # 3
Multiple Choice Question Answering (MCQA) IndicGLUE WSTP Pa xlmindic-base-uniscript Accuracy 77.55 # 1
News Classification Soham News Article Classification xlmindic-base-multiscript Accuracy 93.22 # 2
News Classification Soham News Article Classification xlmindic-base-uniscript Accuracy 93.89 # 1

Methods