Language Identification
123 papers with code • 6 benchmarks • 19 datasets
Language identification is the task of determining the language of a text.
Libraries
Use these libraries to find Language Identification models and implementationsDatasets
Latest papers with no code
A Federated Learning Approach to Privacy Preserving Offensive Language Identification
Since most social media data originates from end users, we propose a privacy preserving decentralized architecture for identifying offensive language online by introducing Federated Learning (FL) in the context of offensive language identification.
FastSpell: the LangId Magic Spell
Language identification is a crucial component in the automated production of language resources, particularly in multilingual and big data contexts.
More than words: Advancements and challenges in speech recognition for singing
This paper addresses the challenges and advancements in speech recognition for singing, a domain distinctly different from standard speech recognition.
Validating and Exploring Large Geographic Corpora
The goal is to understand the impact of upstream data cleaning decisions on downstream corpora with a specific focus on under-represented languages and populations.
Aligning Speech to Languages to Enhance Code-switching Speech Recognition
Performance evaluation using large language models reveals the advantage of the linguistic hint by achieving 14. 1% and 5. 5% relative improvement on test sets of the ASRU and SEAME datasets, respectively.
OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification
Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC).
Detecting Structured Language Alternations in Historical Documents by Combining Language Identification with Fourier Analysis
In this study, we present a generalizable workflow to identify documents in a historic language with a nonstandard language and script combination, Armeno-Turkish.
Acoustic characterization of speech rhythm: going beyond metrics with recurrent neural networks
We argue that deep learning offers a powerful pattern-recognition approach to advance the characterization of the acoustic bases of speech rhythm.
Language Detection for Transliterated Content
The comprehensive exploration of transliteration dynamics supported by innovative approaches and cutting edge technologies like BERT, positions our research at the forefront of addressing unique challenges in the linguistic landscape of digital communication.
Generative linguistic representation for spoken language identification
Effective extraction and application of linguistic features are central to the enhancement of spoken Language IDentification (LID) performance.