2 code implementations • • Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli
Expanding the language coverage of speech technology has the potential to improve access to information for many more people.
We present a single neural network architecture composed of task-agnostic components (ViTs, convolutions, and LSTMs) that achieves state-of-art results on both the ImageNav ("go to location in <this picture>") and ObjectNav ("find a chair") tasks without any task-specific modules like object detection, segmentation, mapping, or planning modules.
Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems.
Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources.
Ranked #87 on Image Classification on ImageNet
no code implementations • 15 Nov 2022 • Derek Xu, Shuyan Dong, Changhan Wang, Suyoun Kim, Zhaojiang Lin, Akshat Shrivastava, Shang-Wen Li, Liang-Hsuan Tseng, Alexei Baevski, Guan-Ting Lin, Hung-Yi Lee, Yizhou Sun, Wei Wang
Recent studies find existing self-supervised speech encoders contain primarily acoustic rather than semantic information.
Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.
Ranked #2 on Speaker Identification on VoxCeleb1 (using extra training data)
Self-supervised learning (SSL) of speech representations has received much attention over the last few years but most work has focused on languages and domains with an abundance of unlabeled data.
In this paper, we show that an alternative 2-stage strategy is far more effective: (1) offline pretraining of visual representations with self-supervised learning (SSL) using large-scale pre-rendered images of indoor environments (Omnidata), and (2) online finetuning of visuomotor representations on specific tasks with image augmentations under long learning schedules.
Our results for models pre-trained on 960h Librispeech dataset and fine-tuned on 10h of transcribed data show that using the same stochastic model, we get a smooth trade-off between word error rate (WER) and inference time with only marginal WER degradation compared to the W2V2 and SEW models trained for a specific setting.
Two pre-training configurations for speech translation and recognition, respectively, are presented to alleviate subtask interference.
We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe.
Unsupervised speech recognition has shown great potential to make Automatic Speech Recognition (ASR) systems accessible to every language.
In this paper, we present a controlled study to better understand the effect of such factors on the performance of pre-trained representations on automatic speech recognition.
While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind.
Ranked #1 on Paraphrase Identification on Quora Question Pairs (Accuracy metric)
2 code implementations • 17 Nov 2021 • Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli
On the CoVoST-2 speech translation benchmark, we improve the previous state of the art by an average of 7. 4 BLEU over 21 translation directions into English.
Ranked #1 on Language Identification on VoxLingua107 (using extra training data)
Recent progress in self-training, self-supervised pretraining and unsupervised learning enabled well performing speech recognition systems without any labeled data.
We present a simple yet effective approach to build multilingual speech-to-text (ST) translation through efficient transfer learning from a pretrained speech encoder and text decoder.
Language identification greatly impacts the success of downstream tasks such as automatic speech recognition.
Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe.
In this paper, we improve speech translation (ST) through effectively leveraging large quantities of unlabeled speech and text data in different and complementary ways.
2 code implementations • 2 Apr 2021 • Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli
On a large-scale competitive setup, we show that pre-training on unlabeled in-domain data reduces the gap between models trained on in-domain and out-of-domain labeled data by 66%-73%.
2 code implementations • 1 Feb 2021 • Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Adelrahman Mohamed, Emmanuel Dupoux
We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation.
Ranked #1 on Resynthesis on LJSpeech
However, prior work has implicitly assumed that the best training configuration for model performance was also the best configuration for mask discovery.
We demonstrate that transformers obtain impressive performance even when some of the layers are randomly initialized and never updated.
We introduce a new unsupervised task, spoken language modeling: the learning of linguistic representations from raw audio signals without any labels, along with the Zero Resource Speech Benchmark 2021: a suite of 4 black-box, zero-shot metrics probing for the quality of the learned models at 4 linguistic levels: phonetics, lexicon, syntax and semantics.
We present a simple yet effective approach to build multilingual speech-to-text (ST) translation by efficient transfer learning from pretrained speech encoder and text decoder.
Neural latent variable models enable the discovery of interesting structure in speech audio data.
Self-training and unsupervised pre-training have emerged as effective approaches to improve speech recognition systems using unlabeled data.
Ranked #1 on Speech Recognition on LibriSpeech train-clean-100 test-other (using extra training data)
This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.
Ranked #1 on Speech Recognition on TIMIT (using extra training data)
We compare self-supervised representation learning algorithms which either explicitly quantize the audio data or learn representations without quantization.
We propose vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task.
Ranked #2 on Speech Recognition on TIMIT (using extra training data)
This paper describes Facebook FAIR's submission to the WMT19 shared news translation task.
Ranked #1 on Machine Translation on WMT2019 English-German
Our experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 36% when only a few hours of transcribed data is available.
Ranked #5 on Speech Recognition on TIMIT (using extra training data)
fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks.
Pre-trained language model representations have been successful in a wide range of language understanding tasks.
We present a new approach for pretraining a bi-directional transformer model that provides significant performance gains across a variety of language understanding problems.
Ranked #10 on Constituency Parsing on Penn Treebank
We predict separate convolution kernels based solely on the current time-step in order to determine the importance of context elements.
Ranked #1 on Machine Translation on WMT 2017 English-Chinese
We introduce adaptive input representations for neural language modeling which extend the adaptive softmax of Grave et al. (2017) to input representations of variable capacity.
Ranked #5 on Language Modelling on One Billion Word