CATT (CATT Arabic Diacritization Benchmark Dataset)

Introduced by Alasmary et al. in CATT: Character-based Arabic Tashkeel Transformer

The CATT benchmark dataset comprises 742 sentences, which were scraped from an internet news source in 2023. It covers multiple topics including science and technology, economics, politics, sports, arts, and culture. It was manually diacritized by two expert native Arabic speakers and then validated by a third expert. This dataset contains names of people and places in both Arabic and English. As for the English names, they are written in Arabic letters and diacritized based on their pronunciation. Also, the numbers in the sentences are written in textual form rather than the numeric form which helps in evaluating the models without the need for a text normalizer (TN).

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


Modalities


Languages