Language Identification

123 papers with code • 6 benchmarks • 19 datasets

Language identification is the task of determining the language of a text.

Libraries

Use these libraries to find Language Identification models and implementations
2 papers
29,192

Latest papers with no code

A Federated Learning Approach to Privacy Preserving Offensive Language Identification

no code yet • 17 Apr 2024

Since most social media data originates from end users, we propose a privacy preserving decentralized architecture for identifying offensive language online by introducing Federated Learning (FL) in the context of offensive language identification.

FastSpell: the LangId Magic Spell

no code yet • 12 Apr 2024

Language identification is a crucial component in the automated production of language resources, particularly in multilingual and big data contexts.

More than words: Advancements and challenges in speech recognition for singing

no code yet • 14 Mar 2024

This paper addresses the challenges and advancements in speech recognition for singing, a domain distinctly different from standard speech recognition.

Validating and Exploring Large Geographic Corpora

no code yet • 13 Mar 2024

The goal is to understand the impact of upstream data cleaning decisions on downstream corpora with a specific focus on under-represented languages and populations.

Aligning Speech to Languages to Enhance Code-switching Speech Recognition

no code yet • 9 Mar 2024

Performance evaluation using large language models reveals the advantage of the linguistic hint by achieving 14. 1% and 5. 5% relative improvement on test sets of the ASRU and SEAME datasets, respectively.

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

no code yet • 20 Feb 2024

Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC).

Detecting Structured Language Alternations in Historical Documents by Combining Language Identification with Fourier Analysis

no code yet • 25 Jan 2024

In this study, we present a generalizable workflow to identify documents in a historic language with a nonstandard language and script combination, Armeno-Turkish.

Acoustic characterization of speech rhythm: going beyond metrics with recurrent neural networks

no code yet • 22 Jan 2024

We argue that deep learning offers a powerful pattern-recognition approach to advance the characterization of the acoustic bases of speech rhythm.

Language Detection for Transliterated Content

no code yet • 9 Jan 2024

The comprehensive exploration of transliteration dynamics supported by innovative approaches and cutting edge technologies like BERT, positions our research at the forefront of addressing unique challenges in the linguistic landscape of digital communication.

Generative linguistic representation for spoken language identification

no code yet • 18 Dec 2023

Effective extraction and application of linguistic features are central to the enhancement of spoken Language IDentification (LID) performance.