The football keyword dataset (FKD), as a new keyword spotting dataset in Persian, is collected with crowdsourcing. This dataset contains nearly 31000 samples in 18 classes.
1 PAPER • 2 BENCHMARKS
HengamCopus is a Persian corpus with temporal tags (BIO standard tagging scheme). This dataset was generated by applying HengamTagger (https://github.com/kargaranamir/parstdex) to a large number of sentences. There are two types of Persian text datasets included in these collections: formal ones (Persian Wikipedia and Hamshahri Corpus), and informal ones (Twitter and HelloKish). In the creation of HengamCorpus, to maximize the diversity of patterns for training and evaluation, they uniformly draw samples from sets of sentences of unique “temporal pattern profile”, presence/absence vector of different temporal patterns within the sentence.
1 PAPER • NO BENCHMARKS YET
An open, broad-coverage corpus for informal Persian named entity recognition was collected from Twitter.
Despite recent advances in vision-and-language tasks, most progress is still focused on resource-rich languages such as English. Furthermore, widespread vision-and-language datasets directly adopt images representative of American or European cultures resulting in bias. Hence we introduce ParsVQA-Caps, the first benchmark in Persian for Visual Question Answering and Image Captioning tasks. We utilize two ways to collect datasets for each task, human-based and template-based for VQA and human-based and web-based for image captioning. The image captioning dataset consists of over 7.5k images and about 9k captions. The VQA dataset consists of almost 11k images and 28.5k question and answer pairs with short and long answers usable for both classification and generation VQA.
PersianQA: a dataset for Persian Question Answering Persian Question Answering (PersianQA) Dataset is a reading comprehension dataset on Persian Wikipedia. The crowd-sourced the dataset consists of more than 9,000 entries. Each entry can be either an impossible-to-answer or a question with one or more answers spanning in the passage (the context) from which the questioner proposed the question. Much like the SQuAD2.0 dataset, the impossible or unanswerable questions can be utilized to create a system which "knows that it doesn't know the answer".
1 PAPER • 1 BENCHMARK
A modification on the ShEMO dataset with help of an Automatic Speech Recognition (ASR) system.
naab: A ready-to-use plug-and-play corpus for Farsi The biggest cleaned and ready-to-use open-source textual corpus in Farsi. It contains about 130GB of data, 250 million paragraphs, and 15 billion words. The project name is derived from the Farsi word NAAB K which means pure and high grade. We also provide the raw version of the corpus called naab-raw and an easy-to-use preprocessor that can be employed by those who wanted to make a customized corpus.