Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. This dataset enables the evaluation of mono- and multi-lingual models in high-, medium-, and low-resource languages. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset. The human annotation procedure was carefully curated to create questions that discriminate between different levels of generalizable language comprehension and is reinforced by extensive quality checks. While all questions directly relate to the passage, the English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. Belebele opens up new avenues for evaluating and analyzing the multilingual abilities of language models and NLP systems.
19 PAPERS • NO BENCHMARKS YET
An open, broad-coverage corpus for informal Persian named entity recognition was collected from Twitter.
3 PAPERS • NO BENCHMARKS YET
Despite recent advances in vision-and-language tasks, most progress is still focused on resource-rich languages such as English. Furthermore, widespread vision-and-language datasets directly adopt images representative of American or European cultures resulting in bias. Hence we introduce ParsVQA-Caps, the first benchmark in Persian for Visual Question Answering and Image Captioning tasks. We utilize two ways to collect datasets for each task, human-based and template-based for VQA and human-based and web-based for image captioning. The image captioning dataset consists of over 7.5k images and about 9k captions. The VQA dataset consists of almost 11k images and 28.5k question and answer pairs with short and long answers usable for both classification and generation VQA.
1 PAPER • NO BENCHMARKS YET
PersianQA: a dataset for Persian Question Answering Persian Question Answering (PersianQA) Dataset is a reading comprehension dataset on Persian Wikipedia. The crowd-sourced the dataset consists of more than 9,000 entries. Each entry can be either an impossible-to-answer or a question with one or more answers spanning in the passage (the context) from which the questioner proposed the question. Much like the SQuAD2.0 dataset, the impossible or unanswerable questions can be utilized to create a system which "knows that it doesn't know the answer".
ShortPersianEmo is a new data set for emotion recognition in Persian short texts. The ShortPersianEmo dataset is a single-label dataset that contains 5472 short Persian texts collected from Twitter and Digikala. Our dataset is annotated according to Rachael Jack’s emotional model in five emotional classes happiness, sadness, anger, fear, and other. Unlike publicly accessible datasets that do not impose any restrictions on text length, ShortPersianEmo specifically focuses on short texts. The average text length in the ShortPersianEmo dataset is 56 words. Table 1 presents a comparison between the introduced ShortPersianEmo dataset and other datasets from the literature for emotion detection in Persian text. For more information on this dataset please read our paper. If you use this dataset in any research work, please cite our paper.
1 PAPER • 1 BENCHMARK
A modification on the ShEMO dataset with help of an Automatic Speech Recognition (ASR) system.