Speech recognition is the task of recognising speech within audio and converting it into text.
( Image credit: SpecAugment )
We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages.
Ranked #1 on
Noisy Speech Recognition
on CHiME clean
ACCENTED SPEECH RECOGNITION END-TO-END SPEECH RECOGNITION NOISY SPEECH RECOGNITION
We introduce fairseq S2T, a fairseq extension for speech-to-text (S2T) modeling tasks such as end-to-end speech recognition and speech-to-text translation.
Ranked #2 on
Speech-to-Text Translation
on MuST-C EN->DE
END-TO-END SPEECH RECOGNITION MACHINE TRANSLATION MULTI-TASK LEARNING SPEECH RECOGNITION SPEECH-TO-TEXT TRANSLATION
This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.
Ranked #1 on
Speech Recognition
on TIMIT
(using extra training data)
We present a state-of-the-art speech recognition system developed using end-to-end deep learning.
On LibriSpeech, we achieve 6. 8% WER on test-other without the use of a language model, and 5. 8% WER with shallow fusion with a language model.
Ranked #1 on
Speech Recognition
on Hub5'00 SwitchBoard
DATA AUGMENTATION END-TO-END SPEECH RECOGNITION LANGUAGE MODELLING SPEECH RECOGNITION
Self-training and unsupervised pre-training have emerged as effective approaches to improve speech recognition systems using unlabeled data.
Ranked #1 on
Speech Recognition
on LibriSpeech train-clean-100 test-clean
(using extra training data)
We propose vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task.
Ranked #2 on
Speech Recognition
on TIMIT
(using extra training data)
Our experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 36% when only a few hours of transcribed data is available.
Ranked #5 on
Speech Recognition
on TIMIT
(using extra training data)
This paper proposes a parallel computation strategy and a posterior-based lattice expansion algorithm for efficient lattice rescoring with neural language models (LMs) for automatic speech recognition.