Modelling and Annotating Interlinear Glossed Text from 280 Different Endangered Languages as Linked Data with LIGT

COLING (LAW) 2020  ·  Sebastian Nordhoff ·

This paper reports on the harvesting, analysis, and enrichment of 20k documents from 4 different endangered language archives in 300 different low-resource languages. The documents are heterogeneous as to their provenance (holding archive, language, geographical area, creator) and internal structure (annotation types, metalanguages), but they have the ELAN-XML format in common... Typical annotations include sentence-level translations, morpheme-segmentation, morpheme-level translations, and parts-of-speech. The ELAN-format gives a lot of freedom to document creators, and hence the data set is very heterogeneous. We use regularities in the ELAN format to arrive at a common internal representation of sentences, words, and morphemes, with translations into one or more additional languages. Building upon the paradigm of Linguistic Linked Open Data (LLOD, Chiarcos, Nordhoff, et al. 2012), the document elements receive unique identifiers and are linked to other resources such as Glottolog for languages, Wikidata for semantic concepts, and the Leipzig Glossing Rules list for category abbreviations. We provide an RDF export in the LIGT format (Chiarcos & Ionov 2019), enabling uniform and interoperable access with some semantic enrichments to a formerly disparate resource type difficult to access. Two use cases (semantic search and colexification) are presented to show the viability of the approach. read more

PDF Abstract
No code implementations yet. Submit your code now

Tasks


Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here