Data-Driven Morphological Analysis and Disambiguation for Morphologically Rich Languages and Universal Dependencies

COLING 2016  ·  Amir More, Reut Tsarfaty ·

Parsing texts into universal dependencies (UD) in realistic scenarios requires infrastructure for the morphological analysis and disambiguation (MA{\&}D) of typologically different languages as a first tier. MA{\&}D is particularly challenging in morphologically rich languages (MRLs), where the ambiguous space-delimited tokens ought to be disambiguated with respect to their constituent morphemes, each morpheme carrying its own tag and a rich set features. Here we present a novel, language-agnostic, framework for MA{\&}D, based on a transition system with two variants {---} word-based and morpheme-based {---} and a dedicated transition to mitigate the biases of variable-length morpheme sequences. Our experiments on a Modern Hebrew case study show state of the art results, and we show that the morpheme-based MD consistently outperforms our word-based variant. We further illustrate the utility and multilingual coverage of our framework by morphologically analyzing and disambiguating the large set of languages in the UD treebanks.

PDF Abstract COLING 2016 PDF COLING 2016 Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here