Language Identification

123 papers with code • 6 benchmarks • 19 datasets

Language identification is the task of determining the language of a text.

Libraries

Use these libraries to find Language Identification models and implementations
2 papers
29,183

What is Learnt by the LEArnable Front-end (LEAF)? Adapting Per-Channel Energy Normalisation (PCEN) to Noisy Conditions

hanyu-meng/adapting-leaf 10 Apr 2024

There is increasing interest in the use of the LEArnable Front-end (LEAF) in a variety of speech processing systems.

4
10 Apr 2024

Geographically-Informed Language Identification

jonathandunn/geolid 14 Mar 2024

The result is a highly-accurate model that covers 916 languages at a sample size of 50 characters, the performance improved by incorporating geographic information into the model.

1
14 Mar 2024

Language and Speech Technology for Central Kurdish Varieties

sinaahmadi/cordi 4 Mar 2024

Kurdish, an Indo-European language spoken by over 30 million speakers, is considered a dialect continuum and known for its diversity in language varieties.

8
04 Mar 2024

KInIT at SemEval-2024 Task 8: Fine-tuned LLMs for Multilingual Machine-Generated Text Detection

michalspiegel/imgtb 21 Feb 2024

SemEval-2024 Task 8 is focused on multigenerator, multidomain, and multilingual black-box machine-generated text detection.

7
21 Feb 2024

Code-Switched Language Identification is Harder Than You Think

laurieburchell/cs-lid-harder-than-you-think 2 Feb 2024

Code switching (CS) is a very common phenomenon in written and spoken communication but one that is handled poorly by many natural language processing applications.

3
02 Feb 2024

AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters

lucy3/whos_filtered 12 Jan 2024

Large language models' (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation.

11
12 Jan 2024

Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math

gair-nlp/mathpile 28 Dec 2023

Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus.

343
28 Dec 2023

OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for Offensive Language Identification

languagetechnologylab/offmix-3l 27 Oct 2023

Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech.

3
27 Oct 2023

GlotLID: Language Identification for Low-Resource Languages

cisnlp/glotlid 24 Oct 2023

Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages.

66
24 Oct 2023

Native Language Identification with Big Bird Embeddings

sergeykramp/mthesis-bigbird-embeddings 13 Sep 2023

Native Language Identification (NLI) intends to classify an author's native language based on their writing in another language.

0
13 Sep 2023