4 dataset results for Multilingual NLP

Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. This dataset enables the evaluation of mono- and multi-lingual models in high-, medium-, and low-resource languages. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset. The human annotation procedure was carefully curated to create questions that discriminate between different levels of generalizable language comprehension and is reinforced by extensive quality checks. While all questions directly relate to the passage, the English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. Belebele opens up new avenues for evaluating and analyzing the multilingual abilities of language models and NLP systems.

19 PAPERS • NO BENCHMARKS YET

Duolingo STAPLE Shared Task

This is the dataset for the 2020 Duolingo shared task on Simultaneous Translation And Paraphrase for Language Education (STAPLE). Sentence prompts, along with automatic translations, and high-coverage sets of translation paraphrases weighted by user response are provided in 5 language pairs. Starter code for this task can be found here: github.com/duolingo/duolingo-sharedtask-2020/. More details on the data set and task are available at: sharedtask.duolingo.com

10 PAPERS • NO BENCHMARKS YET

HumSet

Timely and effective response to humanitarian crises requires quick and accurate analysis of large amounts of text data, a process that can highly benefit from expert-assisted NLP systems trained on validated and annotated data in the humanitarian response domain. To enable creation of such NLP systems, we introduce and release HumSet, a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. The dataset provides documents in three languages (English, French, Spanish) and covers a variety of humanitarian crises from 2018 to 2021 across the globe. For each document, HumSet provides selected snippets (entries) as well as assigned classes to each entry annotated using common humanitarian information analysis frameworks. HumSet also provides novel and challenging entry extraction and multi-label entry classification tasks. In this paper, we take a first step towards approaching these tasks and conduct a set of expe

2 PAPERS • NO BENCHMARKS YET

mBBC dataset

mBBC dataset (Multilingual BBC news)

To construct our multilingual dataset - mBBC - we gathered news articles from various BBC news websites in 43 different languages. This selection was based on the fact that BBC broadcasts news in these 43 languages, providing a global coverage across continents, and spanning a diverse range of language families, scripts, resource-levels, and word order ensuring a comprehensive representation of linguistic diversity. We collected data from various language families such as Indo-European, Sino-Tibetan, Niger-Congo, Austronesian, Dravidian, and more, encompassing several scripts like Latin, Cyrillic, Arabic, Devanagari, Chinese characters, and others. This extensive representation facilitates a comprehensive evaluation of multilingual language models across different linguistic contexts. Moreover, the dataset includes both high-resource languages like English, Spanish, and French, benefiting from extensive linguistic resources, as well as low-resource languages such as Somali, Burmese, an

1 PAPER • NO BENCHMARKS YET

Datasets

4 dataset results for Multilingual NLP