Lexical Normalization

15 papers with code • 1 benchmarks • 1 datasets

Lexical normalization is the task of translating/transforming a non standard text to a standard register.

Example:

new pix comming tomoroe
new pictures coming tomorrow

Datasets usually consists of tweets, since these naturally contain a fair amount of these phenomena.

For lexical normalization, only replacements on the word-level are annotated. Some corpora include annotation for 1-N and N-1 replacements. However, word insertion/deletion and reordering is not part of the task.

Datasets


Latest papers with no code

A Character-level Ngram-based MT Approach for Lexical Normalization in Social Media

no code yet • ACL ARR December 2022

This paper presents an ngram-based MT approach that operates at character-level to generate possible canonical forms for lexical variants in social media text.

Contrastive String Representation Learning using Synthetic Data

no code yet • 8 Oct 2021

We demonstrate the effectiveness of our approach by evaluating the learned representation on the task of string similarity matching.

Sequence-to-Sequence Lexical Normalization with Multilingual Transformers

no code yet • WNUT (ACL) 2021

Our results show that while word-level, intrinsic, performance evaluation is behind other methods, our model improves performance on extrinsic, downstream tasks through normalization compared to models operating on raw, unprocessed, social media text.

Lexical Normalization for Code-switched Data and its Effect on POS-tagging

no code yet • 1 Jun 2020

Lexical normalization, the translation of non-canonical data to standard language, has shown to improve the performance of manynatural language processing tasks on social media.

Norm It! Lexical Normalization for Italian and Its Downstream Effects for Dependency Parsing

no code yet • LREC 2020

However, for Italian, there is no benchmark available for lexical normalization, despite the presence of many benchmarks for other tasks involving social media data.

Synthetic Data for English Lexical Normalization: How Close Can We Get to Manually Annotated Data?

no code yet • LREC 2020

With this system, we score 94. 29 accuracy on the test data, compared to 95. 22 when it is trained on human-annotated data.

An In-depth Analysis of the Effect of Lexical Normalization on the Dependency Parsing of Social Media

no code yet • WS 2019

Existing natural language processing systems have often been designed with standard texts in mind.

Enhancing BERT for Lexical Normalization

no code yet • WS 2019

In this article, focusing on User Generated Content (UGC), we study the ability of BERT to perform lexical normalisation.

Normalization of Indonesian-English Code-Mixed Twitter Data

no code yet • WS 2019

Twitter is an excellent source of data for NLP researches as it offers tremendous amount of textual data.

Lexical Normalization of User-Generated Medical Text

no code yet • WS 2019

In the medical domain, user-generated social media text is increasingly used as a valuable complementary knowledge source to scientific medical literature.