1 code implementation • EMNLP (ACL) 2021 • Changhan Wang, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Ann Lee, Peng-Jen Chen, Jiatao Gu, Juan Pino
This paper presents fairseq Sˆ2, a fairseq extension for speech synthesis.
no code implementations • 15 May 2022 • Bowen Shi, Abdelrahman Mohamed, Wei-Ning Hsu
This paper investigates self-supervised pre-training for audio-visual speaker representation learning where a visual stream showing the speaker's mouth area is used alongside speech as inputs.
no code implementations • 25 Apr 2022 • Apoorv Vyas, Wei-Ning Hsu, Michael Auli, Alexei Baevski
Our results for models pre-trained on 960h Librispeech dataset and fine-tuned on 10h of transcribed data show that using the same stochastic model, we get a smooth trade-off between word error rate (WER) and inference time with only marginal WER degradation compared to the W2V2 and SEW models trained for a specific setting.
no code implementations • ACL 2022 • Yun Tang, Hongyu Gong, Ning Dong, Changhan Wang, Wei-Ning Hsu, Jiatao Gu, Alexei Baevski, Xian Li, Abdelrahman Mohamed, Michael Auli, Juan Pino
Two pre-training configurations for speech translation and recognition, respectively, are presented to alleviate subtask interference.
no code implementations • 6 Apr 2022 • Sravya Popuri, Peng-Jen Chen, Changhan Wang, Juan Pino, Yossi Adi, Jiatao Gu, Wei-Ning Hsu, Ann Lee
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues as there exists little parallel S2ST data, compared to the amount of data available for conventional cascaded systems that consist of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis.
no code implementations • 6 Apr 2022 • Alexander H. Liu, Cheng-I Jeff Lai, Wei-Ning Hsu, Michael Auli, Alexei Baevski, James Glass
We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe.
1 code implementation • 5 Apr 2022 • Alexander H. Liu, Wei-Ning Hsu, Michael Auli, Alexei Baevski
Unsupervised speech recognition has shown great potential to make Automatic Speech Recognition (ASR) systems accessible to every language.
Automatic Speech Recognition
Unsupervised Speech Recognition
no code implementations • 30 Mar 2022 • Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, Emmanuel Dupoux
We introduce dGSLM, the first "textless" model able to generate audio samples of naturalistic spoken dialogues.
no code implementations • 1 Mar 2022 • Ramon Sanabria, Wei-Ning Hsu, Alexei Baevski, Michael Auli
In this paper, we present a controlled study to better understand the effect of such factors on the performance of pre-trained representations.
1 code implementation • 15 Feb 2022 • Eugene Kharitonov, Jade Copet, Kushal Lakhotia, Tu Anh Nguyen, Paden Tomasello, Ann Lee, Ali Elkahky, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi
Textless spoken language processing research aims to extend the applicability of standard NLP toolset onto spoken language and languages with few or no textual resources.
4 code implementations • Preprint 2022 • Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli
While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind.
Ranked #1 on
Paraphrase Identification
on Quora Question Pairs
(Accuracy metric)
1 code implementation • 5 Jan 2022 • Bowen Shi, Wei-Ning Hsu, Abdelrahman Mohamed
Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe.
Ranked #1 on
Audio-Visual Speech Recognition
on LRS3-TED
Audio-Visual Speech Recognition
Automatic Speech Recognition
+3
1 code implementation • ICLR 2022 • Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, Abdelrahman Mohamed
The lip-reading WER is further reduced to 26. 9% when using all 433 hours of labeled data from LRS3 and combined with self-training.
Ranked #1 on
Lipreading
on LRS3-TED
(using extra training data)
no code implementations • 15 Dec 2021 • Ann Lee, Hongyu Gong, Paul-Ambroise Duquenne, Holger Schwenk, Peng-Jen Chen, Changhan Wang, Sravya Popuri, Yossi Adi, Juan Pino, Jiatao Gu, Wei-Ning Hsu
To our knowledge, we are the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs.
no code implementations • arXiv 2021 • Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu-Anh Nguyen, Morgane Rivière, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
no code implementations • 14 Nov 2021 • Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu-Anh Nguyen, Morgane Rivière, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi
We use a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, and emotion.
no code implementations • 15 Oct 2021 • Xutai Ma, Hongyu Gong, Danni Liu, Ann Lee, Yun Tang, Peng-Jen Chen, Wei-Ning Hsu, Phillip Koehn, Juan Pino
We present a direct simultaneous speech-to-speech translation (Simul-S2ST) model, Furthermore, the generation of translation is independent from intermediate text representations.
1 code implementation • 14 Sep 2021 • Changhan Wang, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Ann Lee, Peng-Jen Chen, Jiatao Gu, Juan Pino
This paper presents fairseq S^2, a fairseq extension for speech synthesis.
1 code implementation • ACL 2022 • Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu-Anh Nguyen, Morgane Rivière, Abdelrahman Mohamed, Emmanuel Dupoux, Wei-Ning Hsu
Generative Spoken Language Modeling (GSLM) \cite{Lakhotia2021} is the only prior work addressing the generative aspects of speech pre-training, which replaces text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences.
no code implementations • ACL 2022 • Ann Lee, Peng-Jen Chen, Changhan Wang, Jiatao Gu, Sravya Popuri, Xutai Ma, Adam Polyak, Yossi Adi, Qing He, Yun Tang, Juan Pino, Wei-Ning Hsu
When target text transcripts are available, we design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass.
no code implementations • 14 Jun 2021 • Vimal Manohar, Tatiana Likhomanenko, Qiantong Xu, Wei-Ning Hsu, Ronan Collobert, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed
In this paper, we introduce the Kaizen framework that uses a continuously improving teacher to generate pseudo-labels for semi-supervised speech recognition (ASR).
4 code implementations • 14 Jun 2021 • Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed
Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation.
Ranked #3 on
Speech Recognition
on LibriSpeech test-other
(using extra training data)
3 code implementations • NeurIPS 2021 • Alexei Baevski, Wei-Ning Hsu, Alexis Conneau, Michael Auli
Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe.
2 code implementations • 2 Apr 2021 • Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli
On a large-scale competitive setup, we show that pre-training on unlabeled in-domain data reduces the gap between models trained on in-domain and out-of-domain labeled data by 66%-73%.
2 code implementations • 1 Apr 2021 • Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux
We propose using self-supervised discrete representations for the task of speech resynthesis.
2 code implementations • 1 Feb 2021 • Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Adelrahman Mohamed, Emmanuel Dupoux
We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation.
Ranked #1 on
Resynthesis
on LJSpeech
no code implementations • ACL 2021 • Wei-Ning Hsu, David Harwath, Christopher Song, James Glass
In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision.
1 code implementation • 2 Oct 2020 • Awni Hannun, Vineel Pratap, Jacob Kahn, Wei-Ning Hsu
We introduce a framework for automatic differentiation with weighted finite-state transducers (WFSTs) allowing them to be used dynamically at training time.
no code implementations • 3 Jun 2020 • Sameer Khurana, Antoine Laurent, Wei-Ning Hsu, Jan Chorowski, Adrian Lancucki, Ricard Marxer, James Glass
Probabilistic Latent Variable Models (LVMs) provide an alternative to self-supervised learning approaches for linguistic representation learning from speech.
1 code implementation • 24 Feb 2020 • Wei-Ning Hsu, Ann Lee, Gabriel Synnaeve, Awni Hannun
For sequence transduction tasks like speech recognition, a strong structured prior model encodes rich information about the target space, implicitly ruling out invalid sequences by assigning them low probability.
Ranked #40 on
Speech Recognition
on LibriSpeech test-other
1 code implementation • ICLR 2020 • David Harwath, Wei-Ning Hsu, James Glass
What differentiates this paper from prior work on speech unit learning is the choice of training objective.
no code implementations • 25 Sep 2019 • Wei-Ning Hsu, Ann Lee, Gabriel Synnaeve, Awni Hannun
We propose local prior matching (LPM), a self-supervised objective for speech recognition.
no code implementations • 9 Jul 2019 • Wei-Ning Hsu, David Harwath, James Glass
Transfer learning aims to reduce the amount of data required to excel at a new task by re-using the knowledge acquired from learning other related tasks.
5 code implementations • 5 Apr 2019 • Yu-An Chung, Wei-Ning Hsu, Hao Tang, James Glass
This paper proposes a novel unsupervised autoregressive neural model for learning generic speech representations.
3 code implementations • 21 Feb 2019 • Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X. Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, Yanzhang He, Jan Chorowski, Smit Hinsu, Stella Laurenzo, James Qin, Orhan Firat, Wolfgang Macherey, Suyog Gupta, Ankur Bapna, Shuyuan Zhang, Ruoming Pang, Ron J. Weiss, Rohit Prabhavalkar, Qiao Liang, Benoit Jacob, Bowen Liang, HyoukJoong Lee, Ciprian Chelba, Sébastien Jean, Bo Li, Melvin Johnson, Rohan Anil, Rajat Tibrewal, Xiaobing Liu, Akiko Eriguchi, Navdeep Jaitly, Naveen Ari, Colin Cherry, Parisa Haghani, Otavio Good, Youlong Cheng, Raziel Alvarez, Isaac Caswell, Wei-Ning Hsu, Zongheng Yang, Kuan-Chieh Wang, Ekaterina Gonina, Katrin Tomanek, Ben Vanik, Zelin Wu, Llion Jones, Mike Schuster, Yanping Huang, Dehao Chen, Kazuki Irie, George Foster, John Richardson, Klaus Macherey, Antoine Bruguier, Heiga Zen, Colin Raffel, Shankar Kumar, Kanishka Rao, David Rybach, Matthew Murray, Vijayaditya Peddinti, Maxim Krikun, Michiel A. U. Bacchiani, Thomas B. Jablin, Rob Suderman, Ian Williams, Benjamin Lee, Deepti Bhatia, Justin Carlson, Semih Yavuz, Yu Zhang, Ian McGraw, Max Galkin, Qi Ge, Golan Pundak, Chad Whipkey, Todd Wang, Uri Alon, Dmitry Lepikhin, Ye Tian, Sara Sabour, William Chan, Shubham Toshniwal, Baohua Liao, Michael Nirschl, Pat Rondon
Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models.
2 code implementations • ICLR 2019 • Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming Pang
This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions.
no code implementations • 12 Sep 2018 • Suwon Shon, Wei-Ning Hsu, James Glass
In this paper, we explore the use of a factorized hierarchical variational autoencoder (FHVAE) model to learn an unsupervised latent representation for dialect identification (DID).
no code implementations • 30 Aug 2018 • Yu-An Chung, Yuxuan Wang, Wei-Ning Hsu, Yu Zhang, RJ Skerry-Ryan
We demonstrate that the proposed framework enables Tacotron to generate intelligible speech using less than half an hour of paired training data.
no code implementations • 13 Jun 2018 • Hao Tang, Wei-Ning Hsu, Francois Grondin, James Glass
Speech recognizers trained on close-talking speech do not generalize to distant speech and the word error rate degradation can be as large as 40% absolute.
no code implementations • 13 Jun 2018 • Wei-Ning Hsu, Hao Tang, James Glass
However, it is relatively inexpensive to collect large amounts of unlabeled data from domains that we want the models to generalize to.
no code implementations • 29 May 2018 • Wei-Ning Hsu, James Glass
In this paper, we present a partitioned variational autoencoder (PVAE) and several training objectives to learn disentangled representations, which encode not only the shared factors, but also modality-dependent ones, into separate latent variables.
2 code implementations • 9 Apr 2018 • Wei-Ning Hsu, James Glass
Deep generative models have achieved great success in unsupervised learning with the ability to capture complex nonlinear relationships between latent generating factors and observations.
no code implementations • 7 Mar 2018 • Wei-Ning Hsu, James Glass
The performance of automatic speech recognition (ASR) systems can be significantly compromised by previously unseen conditions, which is typically due to a mismatch between training and testing distributions.
3 code implementations • NeurIPS 2017 • Wei-Ning Hsu, Yu Zhang, James Glass
We present a factorized hierarchical variational autoencoder, which learns disentangled and interpretable representations from sequential data without supervision.
no code implementations • 19 Jul 2017 • Wei-Ning Hsu, Yu Zhang, James Glass
Research on robust speech recognition can be regarded as trying to overcome this domain mismatch issue.
no code implementations • 13 Apr 2017 • Wei-Ning Hsu, Yu Zhang, James Glass
In this paper, we apply a convolutional VAE to model the generative process of natural speech.
no code implementations • COLING 2016 • Salvatore Romeo, Giovanni Da San Martino, Alberto Barr{\'o}n-Cede{\~n}o, Aless Moschitti, ro, Yonatan Belinkov, Wei-Ning Hsu, Yu Zhang, Mitra Mohtarami, James Glass
In real-world data, e. g., from Web forums, text is often contaminated with redundant or irrelevant content, which leads to introducing noise in machine learning algorithms.
no code implementations • 23 Mar 2016 • Wei-Ning Hsu, Yu Zhang, James Glass
We apply a general recurrent neural network (RNN) encoder framework to community question answering (cQA) tasks.
no code implementations • 7 Sep 2015 • Cheng-Tao Chung, Wei-Ning Hsu, Cheng-Yi Lee, Lin-shan Lee
This paper presents a novel approach for enhancing the multiple sets of acoustic patterns automatically discovered from a given corpus.