Our results for models pre-trained on 960h Librispeech dataset and fine-tuned on 10h of transcribed data show that using the same stochastic model, we get a smooth trade-off between word error rate (WER) and inference time with only marginal WER degradation compared to the W2V2 and SEW models trained for a specific setting.
Two pre-training configurations for speech translation and recognition, respectively, are presented to alleviate subtask interference.
We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe.
Unsupervised speech recognition has shown great potential to make Automatic Speech Recognition (ASR) systems accessible to every language.
no code implementations • 21 Mar 2022 • Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan H. Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, Melvin Johnson
Covering 102 languages from 10+ language families, 3 different domains and 4 task families, XTREME-S aims to simplify multilingual speech representation evaluation, as well as catalyze research in "universal" speech representation learning.
In this paper, we present a controlled study to better understand the effect of such factors on the performance of pre-trained representations.
While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind.
Ranked #1 on Paraphrase Identification on Quora Question Pairs (Accuracy metric)
1 code implementation • 17 Nov 2021 • Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli
On the CoVoST-2 speech translation benchmark, we improve the previous state of the art by an average of 7. 4 BLEU over 21 translation directions into English.
Recent progress in self-training, self-supervised pretraining and unsupervised learning enabled well performing speech recognition systems without any labeled data.
Reranking models enable the integration of rich features to select a better output hypothesis within an n-best list or lattice.
We present a simple yet effective approach to build multilingual speech-to-text (ST) translation through efficient transfer learning from a pretrained speech encoder and text decoder.
Language identification greatly impacts the success of downstream tasks such as automatic speech recognition.
Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe.
In this paper, we improve speech translation (ST) through effectively leveraging large quantities of unlabeled speech and text data in different and complementary ways.
2 code implementations • 2 Apr 2021 • Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli
On a large-scale competitive setup, we show that pre-training on unlabeled in-domain data reduces the gap between models trained on in-domain and out-of-domain labeled data by 66%-73%.
Document-level machine translation conditions on surrounding sentences to produce coherent translations.
We demonstrate that transformers obtain impressive performance even when some of the layers are randomly initialized and never updated.
Pre-training models on vast quantities of unlabeled data has emerged as an effective approach to improving accuracy on many NLP tasks.
Ranked #1 on Machine Translation on WMT2016 Romanian-English (using extra training data)
We present a simple yet effective approach to build multilingual speech-to-text (ST) translation by efficient transfer learning from pretrained speech encoder and text decoder.
Neural latent variable models enable the discovery of interesting structure in speech audio data.
Self-training and unsupervised pre-training have emerged as effective approaches to improve speech recognition systems using unlabeled data.
Ranked #1 on Speech Recognition on LibriSpeech train-clean-100 test-other (using extra training data)
4 code implementations • 21 Oct 2020 • Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin
Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages.
Unsupervised pre-training has led to much recent progress in natural language understanding.
This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.
Ranked #1 on Speech Recognition on TIMIT (using extra training data)
We address this problem by reasoning counterfactually about the loss distribution of examples with uniform random labels had they were trained with the real examples, and use this information to remove noisy examples from the training set.
Ranked #29 on Image Classification on mini WebVision 1.0
Neural sequence to sequence models are well established for applications which can be cast as mapping a single input sequence into a single output sequence.
We compare self-supervised representation learning algorithms which either explicitly quantize the audio data or learn representations without quantization.
State of the art sequence-to-sequence models for large scale tasks perform a fixed number of computations for each input sequence regardless of whether it is easy or hard to process.
We propose vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task.
Ranked #2 on Speech Recognition on TIMIT (using extra training data)
While we live in an increasingly interconnected world, different places still exhibit strikingly different cultures and many events we experience in our every day life pertain only to the specific place we live in.
Previous work on neural noisy channel modeling relied on latent variable models that incrementally process the source and target sentence.
Back-translation is a widely used data augmentation technique which leverages target monolingual data.
We introduce the first large-scale corpus for long-form question answering, a task requiring elaborate and in-depth answers to open-ended questions.
This paper describes Facebook FAIR's submission to the WMT19 shared news translation task.
Ranked #1 on Machine Translation on WMT2019 English-German
Our experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 36% when only a few hours of transcribed data is available.
Ranked #5 on Speech Recognition on TIMIT (using extra training data)
fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks.
Pre-trained language model representations have been successful in a wide range of language understanding tasks.
We present a new approach for pretraining a bi-directional transformer model that provides significant performance gains across a variety of language understanding problems.
Ranked #8 on Constituency Parsing on Penn Treebank
We predict separate convolution kernels based solely on the current time-step in order to determine the importance of context elements.
Ranked #1 on Machine Translation on WMT 2017 English-Chinese
We start with predicted 2D keypoints for unlabeled video, then estimate 3D poses and finally back-project to the input 2D keypoints.
Ranked #5 on Monocular 3D Human Pose Estimation on Human3.6M
In open-domain dialogue intelligent agents should exhibit the use of knowledge, however there are few convincing demonstrations of this to date.
We introduce adaptive input representations for neural language modeling which extend the adaptive softmax of Grave et al. (2017) to input representations of variable capacity.
Ranked #5 on Language Modelling on One Billion Word
An effective method to improve neural machine translation with monolingual data is to augment the parallel training corpus with back-translations of target language sentences.
Ranked #2 on Machine Translation on WMT2014 English-German (using extra training data)
Sequence to sequence learning models still require several days to reach state of the art performance on large benchmark datasets using a single machine.
Ranked #12 on Machine Translation on WMT2014 English-French
We propose tools and metrics to assess how uncertainty in the data is captured by the model distribution and how it affects search strategies that generate translations.
Current models for document summarization disregard user preferences such as the desired length, style, the entities that the user might be interested in, or how much of the document the user has already read.
There has been much recent work on training neural attention models at the sequence-level using either reinforcement learning-style methods or by optimizing the beam.
Ranked #4 on Machine Translation on IWSLT2015 German-English
The prevalent approach to sequence to sequence learning maps an input sequence to a variable length output sequence via recurrent neural networks.
Ranked #6 on Machine Translation on IWSLT2015 English-German
The pre-dominant approach to language modeling to date is based on recurrent neural networks.
Ranked #18 on Language Modelling on One Billion Word
The prevalent approach to neural machine translation relies on bi-directional LSTMs to encode the source sentence.
Ranked #7 on Machine Translation on IWSLT2015 German-English
Existing machine translation decoding algorithms generate translations in a strictly monotonic fashion and never revisit previous decisions.
Classical translation models constrain the space of possible outputs by selecting a subset of translation rules based on the input sentence.
We present a simple neural network for word alignment that builds source and target word window representations to compute alignment scores for sentence pairs.
This paper introduces a neural model for concept-to-text generation that scales to large, rich domains.
Ranked #4 on Table-to-Text Generation on WikiBio
Training neural network language models over large vocabularies is still computationally very costly compared to count-based models such as Kneser-Ney.
Many natural language processing applications use language models to generate text.
Ranked #14 on Machine Translation on IWSLT2015 German-English
We introduce Discriminative BLEU (deltaBLEU), a novel metric for intrinsic evaluation of generated text in tasks that admit a diverse range of possible outputs.
We present a novel response generation system that can be trained end to end on large quantities of unstructured Twitter conversations.