719 papers with code • 278 benchmarks • 184 datasets
Speech recognition is the task of recognising speech within audio and converting it into text.
( Image credit: SpecAugment )
We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages.
On LibriSpeech, we achieve 6. 8% WER on test-other without the use of a language model, and 5. 8% WER with shallow fusion with a language model.
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs).
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.
Sequence-to-sequence attention-based models on subword units allow simple open-vocabulary end-to-end speech recognition.