Transliteration

45 papers with code • 0 benchmarks • 5 datasets

Transliteration is a mechanism for converting a word in a source (foreign) language to a target language, and often adopts approaches from machine translation. In machine translation, the objective is to preserve the semantic meaning of the utterance as much as possible while following the syntactic structure in the target language. In Transliteration, the objective is to preserve the original pronunciation of the source word as much as possible while following the phonological structures of the target language.

For example, the city’s name “Manchester” has become well known by people of languages other than English. These new words are often named entities that are important in cross-lingual information retrieval, information extraction, machine translation, and often present out-of-vocabulary challenges to spoken language technologies such as automatic speech recognition, spoken keyword search, and text-to-speech.

Source: Phonology-Augmented Statistical Framework for Machine Transliteration using Limited Linguistic Resources

Benchmarks

Add a Result

These leaderboards are used to track progress in Transliteration

No evaluation results yet. Help compare methods by submitting evaluation metrics.

Datasets

Most implemented papers

Most implemented Social Latest No code

Universal Dependency Parsing for Hindi-English Code-switching

irshadbhat/nsdp-cs • NAACL 2018

We present a treebank of Hindi-English code-switching tweets under Universal Dependencies scheme and propose a neural stacking model for parsing that efficiently leverages part-of-speech tag and syntactic tree annotations in the code-switching treebank and the preexisting Hindi and English treebanks.

Paper
Code

Applying the Transformer to Character-level Transduction

shijie-wu/neural-transducer • • EACL 2021

The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.

Paper
Code

Sub-Character Tokenization for Chinese Pretrained Language Models

thunlp/subchartokenization • • 1 Jun 2021

2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos.

Paper
Code

Aksharantar: Open Indic-language Transliteration datasets and models for the Next Billion Users

AI4Bharat/IndicXlit • • 6 May 2022

Transliteration is very important in the Indian language context due to the usage of multiple scripts and the widespread use of romanized inputs.

Paper
Code

An Ensemble Model of Word-based and Character-based Models for Japanese and Chinese Input Method

nokuno/jsc • WS 2012

Paper
Code

Context Independent Term Mapper for European Languages

pmarcis/mp-aligner • RANLP 2013

Paper
Code

Bilingual dictionaries for all EU languages

pmarcis/dict-filtering • LREC 2014

In this work we present three different methods for cleaning noise from automatically generated bilingual dictionaries: LLR, pivot and translation based approach.

Paper
Code