Speech Recognition

1079 papers with code • 314 benchmarks • 86 datasets

Speech Recognition is the task of converting spoken language into text. It involves recognizing the words spoken in an audio recording and transcribing them into a written format. The goal is to accurately transcribe the speech in real-time or from recorded audio, taking into account factors such as accents, speaking speed, and background noise.

( Image credit: SpecAugment )

Libraries

Use these libraries to find Speech Recognition models and implementations
16 papers
7,770
13 papers
44
11 papers
29,037
See all 16 libraries.

Latest papers with no code

ZAEBUC-Spoken: A Multilingual Multidialectal Arabic-English Speech Corpus

no code yet • 27 Mar 2024

We present ZAEBUC-Spoken, a multilingual multidialectal Arabic-English speech corpus.

DANCER: Entity Description Augmented Named Entity Corrector for Automatic Speech Recognition

no code yet • 26 Mar 2024

End-to-end automatic speech recognition (E2E ASR) systems often suffer from mistranscription of domain-specific phrases, such as named entities, sometimes leading to catastrophic failures in downstream tasks.

Extracting Biomedical Entities from Noisy Audio Transcripts

no code yet • 26 Mar 2024

Our dataset offers a comprehensive collection of almost 2, 000 clean and noisy recordings.

Privacy-Preserving End-to-End Spoken Language Understanding

no code yet • 22 Mar 2024

Thus, the SLU system needs to ensure that a potential malicious attacker cannot deduce the sensitive attributes of the users, while it should avoid greatly compromising the SLU accuracy.

A Multimodal Approach to Device-Directed Speech Detection with Large Language Models

no code yet • 21 Mar 2024

Interactions with virtual assistants typically start with a predefined trigger phrase followed by the user command.

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

no code yet • 21 Mar 2024

It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes.

M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset

no code yet • 21 Mar 2024

Although multiple academic video datasets have been constructed and released, few of them support both multimodal content recognition and understanding tasks, which is partially due to the lack of high-quality human annotations.

Open Access NAO (OAN): a ROS2-based software framework for HRI applications with the NAO robot

no code yet • 20 Mar 2024

Embracing the common demand of researchers for better performance and new features for NAO, the authors took advantage of the ability to run ROS2 onboard on the NAO to develop a framework independent of the APIs provided by the manufacturer.

BanglaNum -- A Public Dataset for Bengali Digit Recognition from Speech

no code yet • 20 Mar 2024

Automatic speech recognition (ASR) converts the human voice into readily understandable and categorized text or words.