Language Identification

123 papers with code • 6 benchmarks • 19 datasets

Language identification is the task of determining the language of a text.

Benchmarks

Add a Result

These leaderboards are used to track progress in Language Identification

Dataset	Best Model	Compare
VoxLingua107	XLS-R	See all
OpenSubtitles	Apple bi-LSTM	See all
Universal Dependencies	Apple bi-LSTM	See all
Nordic Language Identification	FastText	See all
GlotLID-C	GlotLID	See all
VoxForge	ConformerG-P	See all

Libraries

Use these libraries to find Language Identification models and implementations

facebookresearch/fairseq

2 papers

29,287

pytorch/fairseq

2 papers

29,287

Datasets

Subtasks

Latest papers

Most implemented Social Latest No code

My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks

l3cube-pune/MarathiNLP • 24 Jun 2023

This is the first work that presents artifacts for code-mixed Marathi research.

24 Jun 2023

Paper
Code

Unified model for code-switching speech recognition and language identification based on a concatenated tokenizer

NVIDIA/NeMo • • 14 Jun 2023

Code-Switching (CS) multilingual Automatic Speech Recognition (ASR) models can transcribe speech containing two or more alternating languages during a conversation.

10,110

14 Jun 2023

Paper
Code

Spoken Language Identification System for English-Mandarin Code-Switching Child-Directed Speech

shashikg/lid-code-switching • • 1 Jun 2023

This work focuses on improving the Spoken Language Identification (LangId) system for a challenge that focuses on developing robust language identification systems that are reliable for non-standard, accented (Singaporean accent), spontaneous code-switched, and child-directed speech collected via Zoom.

01 Jun 2023

Paper
Code

MERLIon CCS Challenge Evaluation Plan

merlion-challenge/merlion-ccs-2023 • 31 May 2023

This paper introduces the inaugural Multilingual Everyday Recordings- Language Identification on Code-Switched Child-Directed Speech (MERLIon CCS) Challenge, focused on developing robust language identification and language diarization systems that are reliable for non-standard, accented, spontaneous code-switched, child-directed speech collected via Zoom.

31 May 2023

Paper
Code

Investigating model performance in language identification: beyond simple error statistics

merlion-challenge/merlion-ccs-2023 • 30 May 2023

These overview metrics do not provide information about model performance at the level of individual speakers, recordings, or units of speech with different linguistic characteristics.

30 May 2023

Paper
Code

MERLIon CCS Challenge: A English-Mandarin code-switching child-directed speech corpus for language identification and diarization

merlion-challenge/merlion-ccs-2023 • 30 May 2023

To enhance the reliability and robustness of language identification (LID) and language diarization (LD) systems for heterogeneous populations and scenarios, there is a need for speech processing models to be trained on datasets that feature diverse language registers and speech patterns.

30 May 2023

Paper
Code

Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages

ai4bharat/indiclid • • 25 May 2023

We create publicly available language identification (LID) datasets and models in all 22 Indian languages listed in the Indian constitution in both native-script and romanized text.

25 May 2023

Paper
Code

Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities

sinaahmadi/scriptnormalization • 25 May 2023

The wide accessibility of social media has provided linguistically under-represented communities with an extraordinary opportunity to create content in their native languages.

25 May 2023

Paper
Code

Scaling Speech Technology to 1,000+ Languages

facebookresearch/fairseq • • arXiv 2023

Expanding the language coverage of speech technology has the potential to improve access to information for many more people.

29,287

23 May 2023

Paper
Code

An Open Dataset and Model for Language Identification

laurieburchell/open-lid-dataset • 23 May 2023

We achieve this by training on a curated dataset of monolingual data, the reliability of which we ensure by auditing a sample from each source and each language manually.

23 May 2023

Paper
Code

Language Identification

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result