To construct our multilingual dataset - mBBC - we gathered news articles from various BBC news websites in 43 different languages. This selection was based on the fact that BBC broadcasts news in these 43 languages, providing a global coverage across continents, and spanning a diverse range of language families, scripts, resource-levels, and word order ensuring a comprehensive representation of linguistic diversity.
We collected data from various language families such as Indo-European, Sino-Tibetan, Niger-Congo, Austronesian, Dravidian, and more, encompassing several scripts like Latin, Cyrillic, Arabic, Devanagari, Chinese characters, and others. This extensive representation facilitates a comprehensive evaluation of multilingual language models across different linguistic contexts. Moreover, the dataset includes both high-resource languages like English, Spanish, and French, benefiting from extensive linguistic resources, as well as low-resource languages such as Somali, Burmese, an
1 PAPER
• NO BENCHMARKS YET