We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages.
We introduce fairseq S2T, a fairseq extension for speech-to-text (S2T) modeling tasks such as end-to-end speech recognition and speech-to-text translation.
Ranked #8 on Speech-to-Text Translation on MuST-C EN->DE
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.
Ranked #1 on Speech Recognition on TIMIT (using extra training data)
Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation.
Ranked #3 on Speech Recognition on LibriSpeech test-other (using extra training data)
While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind.
Ranked #1 on Paraphrase Identification on Quora Question Pairs (Accuracy metric)
We present a state-of-the-art speech recognition system developed using end-to-end deep learning.
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.
Ranked #4 on Speech Synthesis on North American English
This paper introduces a convolutional recurrent network with attention for speech command recognition.
Ranked #12 on Keyword Spotting on Google Speech Commands