BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding

In this paper, we introduce ``Embedding Barrier'', a phenomenon that limits the monolingual performance of multilingual models on low-resource languages having unique typologies. We build `BanglaBERT', a Bangla language model pretrained on 18.6 GB Internet-crawled data and benchmark on five standard NLU tasks... We discover a significant drop in the performance of the state-of-the-art multilingual model (XLM-R) from BanglaBERT and attribute this to the Embedding Barrier through comprehensive experiments. We identify that a multilingual model's performance on a low-resource language is hurt when its writing script is not similar to any of the high-resource languages. To tackle the barrier, we propose a straightforward solution by transcribing languages to a common script, which can effectively improve the performance of a multilingual model for the Bangla language. As a bi-product of the standard NLU benchmarks, we introduce a new downstream dataset on natural language inference (NLI) and show that BanglaBERT outperforms previous state-of-the-art results on all tasks by up to 3.5%. We are making the BanglaBERT language model and the new Bangla NLI dataset publicly available in the hope of advancing the community. The resources can be found at \url{}. read more

PDF Abstract


Results from the Paper

  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.


No methods listed for this paper. Add relevant methods here