Lemmatization

50 papers with code • 0 benchmarks • 2 datasets

Lemmatization is a process of determining a base or dictionary form (lemma) for a given surface form. Especially for languages with rich morphology it is important to be able to normalize words into their base forms to better support for example search engines and linguistic studies. Main difficulties in Lemmatization arise from encountering previously unseen words during inference time as well as disambiguating ambiguous surface forms which can be inflected variants of several different base forms depending on the context.

Source: Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks

Most implemented papers

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

stanfordnlp/stanza ACL 2020

We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages.

LemmaTag: Jointly Tagging and Lemmatizing for Morphologically-Rich Languages with BRNNs

hyperparticle/LemmaTag 10 Aug 2018

We present LemmaTag, a featureless neural network architecture that jointly generates part-of-speech tags and lemmas for sentences by using bidirectional RNNs with character-level and word-level embeddings.

Improving Lemmatization of Non-Standard Languages with Joint Learning

emanjavacas/pie NAACL 2019

Lemmatization of standard languages is concerned with (i) abstracting over morphological differences and (ii) resolving token-lemma ambiguities of inflected words in order to map them to a dictionary headword.

Top2Vec: Distributed Representations of Topics

ddangelov/Top2Vec 19 Aug 2020

Distributed representations of documents and words have gained popularity due to their ability to capture semantics of words and documents.

HuSpaCy: an industrial-strength Hungarian natural language processing toolkit

huspacy/huspacy 6 Jan 2022

Although there are a couple of open-source language processing pipelines available for Hungarian, none of them satisfies the requirements of today's NLP applications.

Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization

creat89/SummTriver 14 Sep 2012

This paper describes a new method for normalization of words to further reduce the space of representation.

Development of a Hindi Lemmatizer

sainimohit23/hindi-stemmer 24 May 2013

We live in a translingual society, in order to communicate with people from different parts of the world we need to have an expertise in their respective languages.

Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning

jedgusse/collaborative-authorship 4 Mar 2016

In this paper we consider two sequence tagging tasks for medieval Latin: part-of-speech tagging and lemmatization.

Urdu Summary Corpus

humsha/USCorpus LREC 2016

This paper reports the construction of a benchmark corpus for Urdu summaries (abstracts) to facilitate the development and evaluation of single document summarization systems for Urdu language.