A Layered Language Model based Hybrid Approach to Automatic Full Diacritization of Arabic

WS 2017 · Mohamed Al-Badrashiny, Abdelati Hawwari, Mona Diab ·

In this paper we present a system for automatic Arabic text diacritization using three levels of analysis granularity in a layered back off manner. We build and exploit diacritized language models (LM) for each of three different levels of granularity: surface form, morphologically segmented into prefix/stem/suffix, and character level. For each of the passes, we use Viterbi search to pick the most probable diacritization per word in the input. We start with the surface form LM, followed by the morphological level, then finally we leverage the character level LM. Our system outperforms all of the published systems evaluated against the same training and test data. It achieves a 10.87{\%} WER for complete full diacritization including lexical and syntactic diacritization, and 3.0{\%} WER for lexical diacritization, ignoring syntactic diacritization.

PDF Abstract