Language Identification

119 papers with code • 5 benchmarks • 18 datasets

Language identification is the task of determining the language of a text.

Libraries

Use these libraries to find Language Identification models and implementations
2 papers
28,920

Most implemented papers

The WiLI benchmark dataset for written language identification

birolkuyumcu/language_identification 23 Jan 2018

This paper describes the WiLI-2018 benchmark dataset for monolingual written natural language identification.

SpeechBrain: A General-Purpose Speech Toolkit

speechbrain/speechbrain 8 Jun 2021

SpeechBrain is an open-source and all-in-one speech toolkit.

Scaling Speech Technology to 1,000+ Languages

facebookresearch/fairseq arXiv 2023

Expanding the language coverage of speech technology has the potential to improve access to information for many more people.

GlotLID: Language Identification for Low-Resource Languages

cisnlp/glotlid 24 Oct 2023

Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages.

Universal Dependency Parsing for Hindi-English Code-switching

irshadbhat/nsdp-cs NAACL 2018

We present a treebank of Hindi-English code-switching tweets under Universal Dependencies scheme and propose a neural stacking model for parsing that efficiently leverages part-of-speech tag and syntactic tree annotations in the code-switching treebank and the preexisting Hindi and English treebanks.

Predicting the Type and Target of Offensive Posts in Social Media

idontflow/olid NAACL 2019

In particular, we model the task hierarchically, identifying the type and the target of offensive messages in social media.

SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)

VadymV/OffensEval SEMEVAL 2019

We present the results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval).

Word-level Embeddings for Cross-Task Transfer Learning in Speech Processing

bepierre/SpeechVGG 22 Oct 2019

Recent breakthroughs in deep learning often rely on representation learning and knowledge transfer.

Common Voice: A Massively-Multilingual Speech Corpus

facebookresearch/covost LREC 2020

To our knowledge this is the largest audio corpus in the public domain for speech recognition, both in terms of number of hours and number of languages.

VoxLingua107: a Dataset for Spoken Language Recognition

alumae/torch-xvectors-wav 25 Nov 2020

Speech activity detection and speaker diarization are used to extract segments from the videos that contain speech.