Finite State Machine Pattern-Root Arabic Morphological Generator, Analyzer and Diacritizer

LREC 2020  ·  Maha Alkhairy, Afshan Jafri, David Smith ·

We describe and evaluate the Finite-State Arabic Morphologizer (FSAM) {--} a concatenative (prefix-stem-suffix) and templatic (root- pattern) morphologizer that generates and analyzes undiacritized Modern Standard Arabic (MSA) words, and diacritizes them. Our bidirectional unified-architecture finite state machine (FSM) is based on morphotactic MSA grammatical rules. The FSM models the root-pattern structure related to semantics and syntax, making it readily scalable unlike stem-tabulations in prevailing systems. We evaluate the coverage and accuracy of our model, with coverage being percentage of words in Tashkeela (a large corpus) that can be analyzed. Accuracy is computed against a gold standard, comprising words and properties, created from the intersection of UD PADT treebank and Tashkeela. Coverage of analysis (extraction of root and properties from word) is 82{\%}. Accuracy results are: root computed from a word (92{\%}), word generation from a root (100{\%}), non-root properties of a word (97{\%}), and diacritization (84{\%}). FSAM{'}s non-root results match or surpass MADAMIRA{'}s, and root result comparisons are not made because of the concatenative nature of publicly available morphologizers.

PDF Abstract
No code implementations yet. Submit your code now

Tasks


Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here