Lemmatization

61 papers with code • 0 benchmarks • 3 datasets

Lemmatization is a process of determining a base or dictionary form (lemma) for a given surface form. Especially for languages with rich morphology it is important to be able to normalize words into their base forms to better support for example search engines and linguistic studies. Main difficulties in Lemmatization arise from encountering previously unseen words during inference time as well as disambiguating ambiguous surface forms which can be inflected variants of several different base forms depending on the context.

Source: Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks

Benchmarks

Add a Result

These leaderboards are used to track progress in Lemmatization

No evaluation results yet. Help compare methods by submitting evaluation metrics.

Libraries

Use these libraries to find Lemmatization models and implementations

huspacy/huspacy

3 papers

147

Datasets

Most implemented papers

Most implemented Social Latest No code

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

stanfordnlp/stanza • • ACL 2020

We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages.

Paper
Code

LemmaTag: Jointly Tagging and Lemmatizing for Morphologically-Rich Languages with BRNNs

hyperparticle/LemmaTag • • 10 Aug 2018

We present LemmaTag, a featureless neural network architecture that jointly generates part-of-speech tags and lemmas for sentences by using bidirectional RNNs with character-level and word-level embeddings.

Paper
Code

Improving Lemmatization of Non-Standard Languages with Joint Learning

emanjavacas/pie • • NAACL 2019

Lemmatization of standard languages is concerned with (i) abstracting over morphological differences and (ii) resolving token-lemma ambiguities of inflected words in order to map them to a dictionary headword.

Paper
Code

Top2Vec: Distributed Representations of Topics

ddangelov/Top2Vec • • 19 Aug 2020

Distributed representations of documents and words have gained popularity due to their ability to capture semantics of words and documents.

Paper
Code

Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines

huspacy/huspacy • 24 Aug 2023

This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy.

Paper
Code

Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation

TickleForce/ancient-greek-datasets • 24 Aug 2023

In this work, we use a multilingual knowledge distillation approach to train BERT models to produce sentence embeddings for Ancient Greek text.

Paper
Code

Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization

creat89/SummTriver • 14 Sep 2012

This paper describes a new method for normalization of words to further reduce the space of representation.

Paper
Code

Development of a Hindi Lemmatizer

sainimohit23/hindi-stemmer • 24 May 2013

We live in a translingual society, in order to communicate with people from different parts of the world we need to have an expertise in their respective languages.

Paper
Code