Lemmatization
59 papers with code • 0 benchmarks • 3 datasets
Lemmatization is a process of determining a base or dictionary form (lemma) for a given surface form. Especially for languages with rich morphology it is important to be able to normalize words into their base forms to better support for example search engines and linguistic studies. Main difficulties in Lemmatization arise from encountering previously unseen words during inference time as well as disambiguating ambiguous surface forms which can be inflected variants of several different base forms depending on the context.
Source: Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks
Benchmarks
These leaderboards are used to track progress in Lemmatization
Libraries
Use these libraries to find Lemmatization models and implementationsMost implemented papers
Stanza: A Python Natural Language Processing Toolkit for Many Human Languages
We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages.
LemmaTag: Jointly Tagging and Lemmatizing for Morphologically-Rich Languages with BRNNs
We present LemmaTag, a featureless neural network architecture that jointly generates part-of-speech tags and lemmas for sentences by using bidirectional RNNs with character-level and word-level embeddings.
Improving Lemmatization of Non-Standard Languages with Joint Learning
Lemmatization of standard languages is concerned with (i) abstracting over morphological differences and (ii) resolving token-lemma ambiguities of inflected words in order to map them to a dictionary headword.
Top2Vec: Distributed Representations of Topics
Distributed representations of documents and words have gained popularity due to their ability to capture semantics of words and documents.
Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines
This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy.
Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation
In this work, we use a multilingual knowledge distillation approach to train BERT models to produce sentence embeddings for Ancient Greek text.
Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization
This paper describes a new method for normalization of words to further reduce the space of representation.
Development of a Hindi Lemmatizer
We live in a translingual society, in order to communicate with people from different parts of the world we need to have an expertise in their respective languages.
Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning
In this paper we consider two sequence tagging tasks for medieval Latin: part-of-speech tagging and lemmatization.
Analyzing Pre-processing Settings for Urdu Single-document Extractive Summarization
Preprocessing is a preliminary step in many fields including IR and NLP.