883 papers with code • 315 benchmarks • 195 datasets
Speech Recognition is the task of converting spoken language into text. It involves recognizing the words spoken in an audio recording and transcribing them into a written format. The goal is to accurately transcribe the speech in real-time or from recorded audio, taking into account factors such as accents, speaking speed, and background noise.
( Image credit: SpecAugment )
These leaderboards are used to track progress in Speech Recognition
LibrariesUse these libraries to find Speech Recognition models and implementations
DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model
Multilingual self-supervised speech representation models have greatly enhanced the speech recognition performance for low-resource languages, and the compression of these huge models has also become a crucial prerequisite for their industrial application.
Improved DeepFake Detection Using Whisper Features
With a recent influx of voice generation methods, the threat introduced by audio DeepFake (DF) is ever-increasing.
SlothSpeech: Denial-of-service Attack Against Speech Recognition Models
We show that popular ASR models like Speech2Text model and Whisper model have dynamic computation based on different inputs, causing dynamic efficiency.
Perception and Semantic Aware Regularization for Sequential Confidence Calibration
In this work, we find tokens/sequences with high perception and semantic correlations with the target ones contain more correlated and effective information and thus facilitate more effective regularization.
Graph Neural Networks for Contextual ASR with the Tree-Constrained Pointer Generator
The incorporation of biasing words obtained through contextual knowledge is of paramount importance in automatic speech recognition (ASR) applications.
CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification Based on Common Voice
We introduce a simple-to-follow recipe aligned to the SpeechBrain toolkit for accent classification based on Common Voice 7. 0 (English) and Common Voice 11. 0 (Italian, German, and Spanish).
HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition
In particular, multi-head HyperConformer achieves comparable or higher recognition performance while being more efficient than Conformer in terms of inference speed, memory, parameter count, and available training data.
BIG-C: a Multimodal Multi-Purpose Dataset for Bemba
We present BIG-C (Bemba Image Grounded Conversations), a large multimodal dataset for Bemba.
Unit-based Speech-to-Speech Translation Without Parallel Data
We propose an unsupervised speech-to-speech translation (S2ST) system that does not rely on parallel data between the source and target languages.
Scaling Speech Technology to 1,000+ Languages
Expanding the language coverage of speech technology has the potential to improve access to information for many more people.