Lexical normalization is the task of translating/transforming a non standard text to a standard register.
Example:
new pix comming tomoroe new pictures coming tomorrow
Datasets usually consists of tweets, since these naturally contain a fair amount of these phenomena.
For lexical normalization, only replacements on the word-level are annotated. Some corpora include annotation for 1-N and N-1 replacements. However, word insertion/deletion and reordering is not part of the task.
Social media offer an abundant source of valuable raw data, however informal writing can quickly become a bottleneck for many natural language processing (NLP) tasks.
Ranked #3 on
Lexical Normalization
on LexNorm
Such informal and code-switched content are under-resourced in terms of labeled datasets and language models even for popular tasks like sentiment classification.
Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content.
Our model achieves high accuracy for classification on this dataset and outperforms the previous model for multilingual text classification, highlighting language independence of McM.
LEXICAL NORMALIZATION MULTILINGUAL TEXT CLASSIFICATION TEXT CLASSIFICATION TRANSLITERATION
In this paper, we introduce and demonstrate the online demo as well as the command line interface of a lexical normalization system (MoNoise) for a variety of languages.
Recently introduced neural network parsers allow for new approaches to circumvent data sparsity issues by modeling character level information and by exploiting raw data in a semi-supervised setting.
We show that MoNoise beats the state-of-the-art on different normalization benchmarks for English and Dutch, which all define the task of normalization slightly different.
Ranked #1 on
Lexical Normalization
on LexNorm