Search Results

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

31 code implementations8 Dec 2015

We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages.

Accented Speech Recognition Noisy Speech Recognition +1

Unsupervised Cross-lingual Representation Learning for Speech Recognition

4 code implementations24 Jun 2020

This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.

Quantization Representation Learning +2

fairseq S2T: Fast Speech-to-Text Modeling with fairseq

3 code implementations Asian Chapter of the Association for Computational Linguistics 2020

We introduce fairseq S2T, a fairseq extension for speech-to-text (S2T) modeling tasks such as end-to-end speech recognition and speech-to-text translation.

Machine Translation Multi-Task Learning +4

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

16 code implementations NeurIPS 2020

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.

 Ranked #1 on Speech Recognition on TIMIT (using extra training data)

Quantization Self-Supervised Learning +2

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

4 code implementations14 Jun 2021

Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation.

Ranked #3 on Speech Recognition on LibriSpeech test-other (using extra training data)

Representation Learning Speech Recognition

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

4 code implementations Preprint 2022

While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind.

Image Classification Linguistic Acceptability +6

Tacotron: Towards End-to-End Speech Synthesis

28 code implementations29 Mar 2017

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.

Speech Synthesis Text-To-Speech Synthesis

Unsupervised Speech Recognition

3 code implementations NeurIPS 2021

Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe.

speech-recognition Speech Recognition +1

A neural attention model for speech command recognition

8 code implementations27 Aug 2018

This paper introduces a convolutional recurrent network with attention for speech command recognition.

Image Captioning