59 papers with code • 0 benchmarks • 3 datasets
Lemmatization is a process of determining a base or dictionary form (lemma) for a given surface form. Especially for languages with rich morphology it is important to be able to normalize words into their base forms to better support for example search engines and linguistic studies. Main difficulties in Lemmatization arise from encountering previously unseen words during inference time as well as disambiguating ambiguous surface forms which can be inflected variants of several different base forms depending on the context.
These leaderboards are used to track progress in Lemmatization
LibrariesUse these libraries to find Lemmatization models and implementations
We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages.
We present LemmaTag, a featureless neural network architecture that jointly generates part-of-speech tags and lemmas for sentences by using bidirectional RNNs with character-level and word-level embeddings.
Lemmatization of standard languages is concerned with (i) abstracting over morphological differences and (ii) resolving token-lemma ambiguities of inflected words in order to map them to a dictionary headword.
This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy.
In this work, we use a multilingual knowledge distillation approach to train BERT models to produce sentence embeddings for Ancient Greek text.
This paper describes a new method for normalization of words to further reduce the space of representation.
In this paper we consider two sequence tagging tasks for medieval Latin: part-of-speech tagging and lemmatization.