Multi-components System for Automatic Arabic Diacritization

8 Apr 2020  ·  Hamza Abbad, Shengwu Xiong ·

In this paper, we propose an approach to tackle the problem of the automatic restoration of Arabic diacritics that includes three components stacked in a pipeline: a deep learning model which is a multi-layer recurrent neural network with LSTM and Dense layers, a character-level rule-based corrector which applies deterministic operations to prevent some errors, and a word-level statistical corrector which uses the context and the distance information to fix some diacritization issues. This approach is novel in a way that combines methods of different types and adds edit distance based corrections. We used a large public dataset containing raw diacritized Arabic text (Tashkeela) for training and testing our system after cleaning and normalizing it. On a newly-released benchmark test set, our system outperformed all the tested systems by achieving DER of 3.39% and WER of 9.94% when taking all Arabic letters into account, DER of 2.61% and WER of 5.83% when ignoring the diacritization of the last letter of every word.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Arabic Text Diacritization Tashkeela MC Diacritic Error Rate 0.0339 # 5
Word Error Rate (WER) 0.0994 # 5

Methods