Lexical Normalization

15 papers with code • 1 benchmarks • 1 datasets

Lexical normalization is the task of translating/transforming a non standard text to a standard register.

Example:

new pix comming tomoroe
new pictures coming tomorrow

Datasets usually consists of tweets, since these naturally contain a fair amount of these phenomena.

For lexical normalization, only replacements on the word-level are annotated. Some corpora include annotation for 1-N and N-1 replacements. However, word insertion/deletion and reordering is not part of the task.

Datasets


ViLexNorm: A Lexical Normalization Corpus for Vietnamese Social Media Text

ngxtnhi/vilexnorm 29 Jan 2024

In this work, we introduce Vietnamese Lexical Normalization (ViLexNorm), the first-ever corpus developed for the Vietnamese lexical normalization task.

5
29 Jan 2024

Automatic Textual Normalization for Hate Speech Detection

anhhoang0529/small-lexnormvihsd 12 Nov 2023

Our dataset is accessible for research purposes.

3
12 Nov 2023

ÚFAL at MultiLexNorm 2021: Improving Multilingual Lexical Normalization by Fine-tuning ByT5

ufal/multilexnorm2021 WNUT (ACL) 2021

We present the winning entry to the Multilingual Lexical Normalization (MultiLexNorm) shared task at W-NUT 2021 (van der Goot et al., 2021a), which evaluates lexical-normalization systems on 12 social media datasets in 11 languages.

15
28 Oct 2021

DaN+: Danish Nested Named Entities and Lexical Normalization

bplank/DaNplus COLING 2020

We examine language-specific versus multilingual BERT, and study the effect of lexical normalization on NER.

5
24 May 2021

User-Generated Text Corpus for Evaluating Japanese Morphological Analysis and Lexical Normalization

shigashiyama/jlexnorm NAACL 2021

Morphological analysis (MA) and lexical normalization (LN) are both important tasks for Japanese user-generated text (UGT).

3
08 Apr 2021

Lexical Normalization for Code-switched Data and its Effect on POS Tagging

ozlemcek/TrDeNormData EACL 2021

Lexical normalization, the translation of non-canonical data to standard language, has shown to improve the performance of many natural language processing tasks on social media.

2
01 Apr 2021

A Clustering Framework for Lexical Normalization of Roman Urdu

abdulrafae/normalization 31 Mar 2020

Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content.

1
31 Mar 2020

Adapting Deep Learning for Sentiment Classification of Code-Switched Informal Short Text

haroonshakeel/multisenti 4 Jan 2020

Such informal and code-switched content are under-resourced in terms of labeled datasets and language models even for popular tasks like sentiment classification.

2
04 Jan 2020

A Multi-cascaded Deep Model for Bilingual SMS Classification

haroonshakeel/bilingual_sms_classification 29 Nov 2019

Our model achieves high accuracy for classification on this dataset and outperforms the previous model for multilingual text classification, highlighting language independence of McM.

0
29 Nov 2019

MoNoise: A Multi-lingual and Easy-to-use Lexical Normalization Tool

robvanderg/cacheembeds ACL 2019

In this paper, we introduce and demonstrate the online demo as well as the command line interface of a lexical normalization system (MoNoise) for a variety of languages.

0
01 Jul 2019