Language Identification
123 papers with code • 6 benchmarks • 19 datasets
Language identification is the task of determining the language of a text.
Libraries
Use these libraries to find Language Identification models and implementationsDatasets
Latest papers
My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks
This is the first work that presents artifacts for code-mixed Marathi research.
Unified model for code-switching speech recognition and language identification based on a concatenated tokenizer
Code-Switching (CS) multilingual Automatic Speech Recognition (ASR) models can transcribe speech containing two or more alternating languages during a conversation.
Spoken Language Identification System for English-Mandarin Code-Switching Child-Directed Speech
This work focuses on improving the Spoken Language Identification (LangId) system for a challenge that focuses on developing robust language identification systems that are reliable for non-standard, accented (Singaporean accent), spontaneous code-switched, and child-directed speech collected via Zoom.
MERLIon CCS Challenge Evaluation Plan
This paper introduces the inaugural Multilingual Everyday Recordings- Language Identification on Code-Switched Child-Directed Speech (MERLIon CCS) Challenge, focused on developing robust language identification and language diarization systems that are reliable for non-standard, accented, spontaneous code-switched, child-directed speech collected via Zoom.
Investigating model performance in language identification: beyond simple error statistics
These overview metrics do not provide information about model performance at the level of individual speakers, recordings, or units of speech with different linguistic characteristics.
MERLIon CCS Challenge: A English-Mandarin code-switching child-directed speech corpus for language identification and diarization
To enhance the reliability and robustness of language identification (LID) and language diarization (LD) systems for heterogeneous populations and scenarios, there is a need for speech processing models to be trained on datasets that feature diverse language registers and speech patterns.
Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages
We create publicly available language identification (LID) datasets and models in all 22 Indian languages listed in the Indian constitution in both native-script and romanized text.
Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities
The wide accessibility of social media has provided linguistically under-represented communities with an extraordinary opportunity to create content in their native languages.
Scaling Speech Technology to 1,000+ Languages
Expanding the language coverage of speech technology has the potential to improve access to information for many more people.
An Open Dataset and Model for Language Identification
We achieve this by training on a curated dataset of monolingual data, the reliability of which we ensure by auditing a sample from each source and each language manually.