Lexical Normalization

12 papers with code • 1 benchmarks • 1 datasets

Lexical normalization is the task of translating/transforming a non standard text to a standard register.

Example:

new pix comming tomoroe
new pictures coming tomorrow

Datasets usually consists of tweets, since these naturally contain a fair amount of these phenomena.

For lexical normalization, only replacements on the word-level are annotated. Some corpora include annotation for 1-N and N-1 replacements. However, word insertion/deletion and reordering is not part of the task.

Datasets


Most implemented papers

MoNoise: Modeling Noise Using a Modular Normalization System

robvanderg/monoise 10 Oct 2017

We show that MoNoise beats the state-of-the-art on different normalization benchmarks for English and Dutch, which all define the task of normalization slightly different.

Modeling Input Uncertainty in Neural Network Dependency Parsing

robvanderg/normpar EMNLP 2018

Recently introduced neural network parsers allow for new approaches to circumvent data sparsity issues by modeling character level information and by exploiting raw data in a semi-supervised setting.

Adapting Sequence to Sequence models for Text Normalization in Social Media

Isminoula/TextNormSeq2Seq 12 Apr 2019

Social media offer an abundant source of valuable raw data, however informal writing can quickly become a bottleneck for many natural language processing (NLP) tasks.

MoNoise: A Multi-lingual and Easy-to-use Lexical Normalization Tool

robvanderg/cacheembeds ACL 2019

In this paper, we introduce and demonstrate the online demo as well as the command line interface of a lexical normalization system (MoNoise) for a variety of languages.

A Multi-cascaded Deep Model for Bilingual SMS Classification

haroonshakeel/bilingual_sms_classification 29 Nov 2019

Our model achieves high accuracy for classification on this dataset and outperforms the previous model for multilingual text classification, highlighting language independence of McM.

Adapting Deep Learning for Sentiment Classification of Code-Switched Informal Short Text

haroonshakeel/multisenti 4 Jan 2020

Such informal and code-switched content are under-resourced in terms of labeled datasets and language models even for popular tasks like sentiment classification.

A Clustering Framework for Lexical Normalization of Roman Urdu

abdulrafae/normalization 31 Mar 2020

Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content.

Lexical Normalization for Code-switched Data and its Effect on POS Tagging

ozlemcek/TrDeNormData EACL 2021

Lexical normalization, the translation of non-canonical data to standard language, has shown to improve the performance of many natural language processing tasks on social media.

User-Generated Text Corpus for Evaluating Japanese Morphological Analysis and Lexical Normalization

shigashiyama/jlexnorm NAACL 2021

Morphological analysis (MA) and lexical normalization (LN) are both important tasks for Japanese user-generated text (UGT).

DaN+: Danish Nested Named Entities and Lexical Normalization

bplank/DaNplus COLING 2020

We examine language-specific versus multilingual BERT, and study the effect of lexical normalization on NER.