Search Results

Unsupervised Cross-lingual Representation Learning for Speech Recognition

8 code implementations24 Jun 2020

This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.

Quantization Representation Learning +2

fairseq S2T: Fast Speech-to-Text Modeling with fairseq

5 code implementations Asian Chapter of the Association for Computational Linguistics 2020

We introduce fairseq S2T, a fairseq extension for speech-to-text (S2T) modeling tasks such as end-to-end speech recognition and speech-to-text translation.

Machine Translation Multi-Task Learning +4

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

36 code implementations8 Dec 2015

We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages.

Accented Speech Recognition Noisy Speech Recognition

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

25 code implementations NeurIPS 2020

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.

 Ranked #1 on Speech Recognition on TIMIT (using extra training data)

Quantization Self-Supervised Learning +1

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

10 code implementations14 Jun 2021

Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation.

Clustering Language Modelling +3

Robust Speech Recognition via Large-Scale Weak Supervision

13 code implementations Preprint 2022

We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet.

Robust Speech Recognition speech-recognition +1

Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition

1 code implementation17 Jul 2023

Most end-to-end (E2E) speech recognition models are composed of encoder and decoder blocks that perform acoustic and language modeling functions.

Decoder Language Modeling +4

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

11 code implementations Preprint 2022

While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind.

Image Classification Linguistic Acceptability +5

Speech-to-speech translation for a real-world unwritten language

1 code implementation arXiv 2022

We use English-Taiwanese Hokkien as a case study, and present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.

 Ranked #1 on Speech-to-Speech Translation on TAT (using extra training data)

Speech-to-Speech Translation Translation

UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units

1 code implementation15 Dec 2022

We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization.

Decoder Denoising +4