The CATT benchmark dataset comprises 742 sentences, which were scraped from an internet news source in 2023. It covers multiple topics including science and technology, economics, politics, sports, arts, and culture. It was manually diacritized by two expert native Arabic speakers and then validated by a third expert. This dataset contains names of people and places in both Arabic and English. As for the English names, they are written in Arabic letters and diacritized based on their pronunciation. Also, the numbers in the sentences are written in textual form rather than the numeric form which helps in evaluating the models without the need for a text normalizer (TN).
Paper | Code | Results | Date | Stars |
---|