Lemmatization

61 papers with code • 0 benchmarks • 3 datasets

Lemmatization is a process of determining a base or dictionary form (lemma) for a given surface form. Especially for languages with rich morphology it is important to be able to normalize words into their base forms to better support for example search engines and linguistic studies. Main difficulties in Lemmatization arise from encountering previously unseen words during inference time as well as disambiguating ambiguous surface forms which can be inflected variants of several different base forms depending on the context.

Source: Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks

Libraries

Use these libraries to find Lemmatization models and implementations
3 papers
147

Latest papers with no code

H2-Golden-Retriever: Methodology and Tool for an Evidence-Based Hydrogen Research Grantsmanship

no code yet • 16 Nov 2022

The Knowledge Graph module was used for the generation of meaningful entities and their relationships, trends and patterns in relevant H2 papers, thanks to an ontology of the hydrogen production domain.

Development of a rule-based lemmatization algorithm through Finite State Machine for Uzbek language

no code yet • 28 Oct 2022

This lemmatization consists of the general rules and a part of speech data of the Uzbek language, affixes, classification of affixes, removing affixes on the basis of the finite state machine for each class, as well as a definition of this word lemma.

Arabic Word-level Readability Visualization for Assisted Text Simplification

no code yet • 19 Oct 2022

This demo paper presents a Google Docs add-on for automatic Arabic word-level readability visualization.

Social Media Personal Event Notifier Using NLP and Machine Learning

no code yet • 10 Oct 2022

Social media apps have become very promising and omnipresent in daily life.

Context based lemmatizer for Polish language

no code yet • 23 Jul 2022

In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning.

TArC: Tunisian Arabish Corpus First complete release

no code yet • 11 Jul 2022

In this paper we present the final result of a project on Tunisian Arabic encoded in Arabizi, the Latin-based writing system for digital conversations.

The 2021 Urdu Fake News Detection Task using Supervised Machine Learning and Feature Combinations

no code yet • 6 Apr 2022

Our submitted results ranked fifth in the competition.

Abusive and Threatening Language Detection in Urdu using Supervised Machine Learning and Feature Combinations

no code yet • 6 Apr 2022

This paper reports a non-exhaustive list of experiments that allowed us to reach the submitted results.

Supervised and Unsupervised Categorization of an Imbalanced Italian Crime News Dataset

no code yet • Lecture Notes in Business Information Processing 2022

The scope of this paper is to explore the use of word embeddings for Italian crime news text categorization.

POS tagging, lemmatization and dependency parsing of West Frisian

no code yet • LREC 2022

POS tags were assigned to words by using a Dutch POS tagger that was applied to a literal word-by-word translation, or to sentences of a Dutch parallel text.