Language Identification

123 papers with code • 6 benchmarks • 19 datasets

Language identification is the task of determining the language of a text.

Benchmarks

Add a Result

These leaderboards are used to track progress in Language Identification

Dataset	Best Model	Compare
VoxLingua107	XLS-R	See all
OpenSubtitles	Apple bi-LSTM	See all
Universal Dependencies	Apple bi-LSTM	See all
Nordic Language Identification	FastText	See all
GlotLID-C	GlotLID	See all
VoxForge	ConformerG-P	See all

Libraries

Use these libraries to find Language Identification models and implementations

facebookresearch/fairseq

2 papers

29,185

pytorch/fairseq

2 papers

29,183

Datasets

Subtasks

Latest papers

Most implemented Social Latest No code

What is Learnt by the LEArnable Front-end (LEAF)? Adapting Per-Channel Energy Normalisation (PCEN) to Noisy Conditions

hanyu-meng/adapting-leaf • • 10 Apr 2024

There is increasing interest in the use of the LEArnable Front-end (LEAF) in a variety of speech processing systems.

10 Apr 2024

Paper
Code

Geographically-Informed Language Identification

jonathandunn/geolid • 14 Mar 2024

The result is a highly-accurate model that covers 916 languages at a sample size of 50 characters, the performance improved by incorporating geographic information into the model.

14 Mar 2024

Paper
Code

Language and Speech Technology for Central Kurdish Varieties

sinaahmadi/cordi • 4 Mar 2024

Kurdish, an Indo-European language spoken by over 30 million speakers, is considered a dialect continuum and known for its diversity in language varieties.

04 Mar 2024

Paper
Code

KInIT at SemEval-2024 Task 8: Fine-tuned LLMs for Multilingual Machine-Generated Text Detection

michalspiegel/imgtb • • 21 Feb 2024

SemEval-2024 Task 8 is focused on multigenerator, multidomain, and multilingual black-box machine-generated text detection.

21 Feb 2024

Paper
Code

Code-Switched Language Identification is Harder Than You Think

laurieburchell/cs-lid-harder-than-you-think • 2 Feb 2024

Code switching (CS) is a very common phenomenon in written and spoken communication but one that is handled poorly by many natural language processing applications.

02 Feb 2024

Paper
Code

AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters

lucy3/whos_filtered • 12 Jan 2024

Large language models' (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation.

12 Jan 2024

Paper
Code

Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math

gair-nlp/mathpile • 28 Dec 2023

Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus.

343

28 Dec 2023

Paper
Code