Language Identification

123 papers with code • 6 benchmarks • 19 datasets

Language identification is the task of determining the language of a text.

Libraries

Use these libraries to find Language Identification models and implementations
2 papers
29,287

My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks

l3cube-pune/MarathiNLP 24 Jun 2023

This is the first work that presents artifacts for code-mixed Marathi research.

87
24 Jun 2023

Unified model for code-switching speech recognition and language identification based on a concatenated tokenizer

NVIDIA/NeMo 14 Jun 2023

Code-Switching (CS) multilingual Automatic Speech Recognition (ASR) models can transcribe speech containing two or more alternating languages during a conversation.

10,110
14 Jun 2023

Spoken Language Identification System for English-Mandarin Code-Switching Child-Directed Speech

shashikg/lid-code-switching 1 Jun 2023

This work focuses on improving the Spoken Language Identification (LangId) system for a challenge that focuses on developing robust language identification systems that are reliable for non-standard, accented (Singaporean accent), spontaneous code-switched, and child-directed speech collected via Zoom.

1
01 Jun 2023

MERLIon CCS Challenge Evaluation Plan

merlion-challenge/merlion-ccs-2023 31 May 2023

This paper introduces the inaugural Multilingual Everyday Recordings- Language Identification on Code-Switched Child-Directed Speech (MERLIon CCS) Challenge, focused on developing robust language identification and language diarization systems that are reliable for non-standard, accented, spontaneous code-switched, child-directed speech collected via Zoom.

2
31 May 2023

Investigating model performance in language identification: beyond simple error statistics

merlion-challenge/merlion-ccs-2023 30 May 2023

These overview metrics do not provide information about model performance at the level of individual speakers, recordings, or units of speech with different linguistic characteristics.

2
30 May 2023

MERLIon CCS Challenge: A English-Mandarin code-switching child-directed speech corpus for language identification and diarization

merlion-challenge/merlion-ccs-2023 30 May 2023

To enhance the reliability and robustness of language identification (LID) and language diarization (LD) systems for heterogeneous populations and scenarios, there is a need for speech processing models to be trained on datasets that feature diverse language registers and speech patterns.

2
30 May 2023

Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages

ai4bharat/indiclid 25 May 2023

We create publicly available language identification (LID) datasets and models in all 22 Indian languages listed in the Indian constitution in both native-script and romanized text.

4
25 May 2023

Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities

sinaahmadi/scriptnormalization 25 May 2023

The wide accessibility of social media has provided linguistically under-represented communities with an extraordinary opportunity to create content in their native languages.

2
25 May 2023

Scaling Speech Technology to 1,000+ Languages

facebookresearch/fairseq arXiv 2023

Expanding the language coverage of speech technology has the potential to improve access to information for many more people.

29,287
23 May 2023

An Open Dataset and Model for Language Identification

laurieburchell/open-lid-dataset 23 May 2023

We achieve this by training on a curated dataset of monolingual data, the reliability of which we ensure by auditing a sample from each source and each language manually.

52
23 May 2023