naab

Introduced by Sabouri et al. in naab: A ready-to-use plug-and-play corpus for Farsi

naab: A ready-to-use plug-and-play corpus for Farsi

The biggest cleaned and ready-to-use open-source textual corpus in Farsi. It contains about 130GB of data, 250 million paragraphs, and 15 billion words. The project name is derived from the Farsi word NAAB K which means pure and high grade. We also provide the raw version of the corpus called naab-raw and an easy-to-use preprocessor that can be employed by those who wanted to make a customized corpus.

Homepage

Benchmarks

Add a new result Link an existing benchmark

No benchmarks yet. Start a new benchmark or link an existing one.

Papers

Paper	Code	Results	Date	Stars

Dataset Loaders

Add Remove

No data loaders found. You can submit your data loader here.

Tasks

Usage

License

Unknown

Modalities

Languages

Persian
Iranian Persian

naab

naab: A ready-to-use plug-and-play corpus for Farsi

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit