Search Results for author: Wei-Ning Hsu

Found 50 papers, 21 papers with code

Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT

no code implementations15 May 2022 Bowen Shi, Abdelrahman Mohamed, Wei-Ning Hsu

This paper investigates self-supervised pre-training for audio-visual speaker representation learning where a visual stream showing the speaker's mouth area is used alongside speech as inputs.

Representation Learning Speaker Verification

On-demand compute reduction with stochastic wav2vec 2.0

no code implementations25 Apr 2022 Apoorv Vyas, Wei-Ning Hsu, Michael Auli, Alexei Baevski

Our results for models pre-trained on 960h Librispeech dataset and fine-tuned on 10h of transcribed data show that using the same stochastic model, we get a smooth trade-off between word error rate (WER) and inference time with only marginal WER degradation compared to the W2V2 and SEW models trained for a specific setting.

Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation

no code implementations6 Apr 2022 Sravya Popuri, Peng-Jen Chen, Changhan Wang, Juan Pino, Yossi Adi, Jiatao Gu, Wei-Ning Hsu, Ann Lee

Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues as there exists little parallel S2ST data, compared to the amount of data available for conventional cascaded systems that consist of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis.

Automatic Speech Recognition Data Augmentation +4

Towards End-to-end Unsupervised Speech Recognition

1 code implementation5 Apr 2022 Alexander H. Liu, Wei-Ning Hsu, Michael Auli, Alexei Baevski

Unsupervised speech recognition has shown great potential to make Automatic Speech Recognition (ASR) systems accessible to every language.

Automatic Speech Recognition Unsupervised Speech Recognition

Measuring the Impact of Individual Domain Factors in Self-Supervised Pre-Training

no code implementations1 Mar 2022 Ramon Sanabria, Wei-Ning Hsu, Alexei Baevski, Michael Auli

In this paper, we present a controlled study to better understand the effect of such factors on the performance of pre-trained representations.

Automatic Speech Recognition

textless-lib: a Library for Textless Spoken Language Processing

1 code implementation15 Feb 2022 Eugene Kharitonov, Jade Copet, Kushal Lakhotia, Tu Anh Nguyen, Paden Tomasello, Ann Lee, Ali Elkahky, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi

Textless spoken language processing research aims to extend the applicability of standard NLP toolset onto spoken language and languages with few or no textual resources.

Resynthesis

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

4 code implementations Preprint 2022 Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli

While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind.

Image Classification Linguistic Acceptability +5

Robust Self-Supervised Audio-Visual Speech Recognition

1 code implementation5 Jan 2022 Bowen Shi, Wei-Ning Hsu, Abdelrahman Mohamed

Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe.

Audio-Visual Speech Recognition Automatic Speech Recognition +3

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

1 code implementation ICLR 2022 Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, Abdelrahman Mohamed

The lip-reading WER is further reduced to 26. 9% when using all 433 hours of labeled data from LRS3 and combined with self-training.

 Ranked #1 on Lipreading on LRS3-TED (using extra training data)

Automatic Speech Recognition Lipreading +2

Textless Speech-to-Speech Translation on Real Data

no code implementations15 Dec 2021 Ann Lee, Hongyu Gong, Paul-Ambroise Duquenne, Holger Schwenk, Peng-Jen Chen, Changhan Wang, Sravya Popuri, Yossi Adi, Juan Pino, Jiatao Gu, Wei-Ning Hsu

To our knowledge, we are the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs.

Speech-to-Speech Translation Translation

Textless Speech Emotion Conversion using Discrete and Decomposed Representations

no code implementations14 Nov 2021 Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu-Anh Nguyen, Morgane Rivière, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi

We use a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, and emotion.

Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention

no code implementations15 Oct 2021 Xutai Ma, Hongyu Gong, Danni Liu, Ann Lee, Yun Tang, Peng-Jen Chen, Wei-Ning Hsu, Phillip Koehn, Juan Pino

We present a direct simultaneous speech-to-speech translation (Simul-S2ST) model, Furthermore, the generation of translation is independent from intermediate text representations.

Speech Synthesis Speech-to-Speech Translation +1

Text-Free Prosody-Aware Generative Spoken Language Modeling

1 code implementation ACL 2022 Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu-Anh Nguyen, Morgane Rivière, Abdelrahman Mohamed, Emmanuel Dupoux, Wei-Ning Hsu

Generative Spoken Language Modeling (GSLM) \cite{Lakhotia2021} is the only prior work addressing the generative aspects of speech pre-training, which replaces text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences.

Language Modelling

Direct speech-to-speech translation with discrete units

no code implementations ACL 2022 Ann Lee, Peng-Jen Chen, Changhan Wang, Jiatao Gu, Sravya Popuri, Xutai Ma, Adam Polyak, Yossi Adi, Qing He, Yun Tang, Juan Pino, Wei-Ning Hsu

When target text transcripts are available, we design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass.

Speech-to-Speech Translation Text Generation +1

Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition

no code implementations14 Jun 2021 Vimal Manohar, Tatiana Likhomanenko, Qiantong Xu, Wei-Ning Hsu, Ronan Collobert, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed

In this paper, we introduce the Kaizen framework that uses a continuously improving teacher to generate pseudo-labels for semi-supervised speech recognition (ASR).

Frame Speech Recognition

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

4 code implementations14 Jun 2021 Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed

Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation.

Ranked #3 on Speech Recognition on LibriSpeech test-other (using extra training data)

Representation Learning Speech Recognition

Unsupervised Speech Recognition

3 code implementations NeurIPS 2021 Alexei Baevski, Wei-Ning Hsu, Alexis Conneau, Michael Auli

Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe.

Speech Recognition Unsupervised Speech Recognition

Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

2 code implementations2 Apr 2021 Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli

On a large-scale competitive setup, we show that pre-training on unlabeled in-domain data reduces the gap between models trained on in-domain and out-of-domain labeled data by 66%-73%.

Self-Supervised Learning

Generative Spoken Language Modeling from Raw Audio

2 code implementations1 Feb 2021 Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Adelrahman Mohamed, Emmanuel Dupoux

We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation.

Language Modelling Resynthesis

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

no code implementations ACL 2021 Wei-Ning Hsu, David Harwath, Christopher Song, James Glass

In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision.

Image Captioning Speech Synthesis +1

Differentiable Weighted Finite-State Transducers

1 code implementation2 Oct 2020 Awni Hannun, Vineel Pratap, Jacob Kahn, Wei-Ning Hsu

We introduce a framework for automatic differentiation with weighted finite-state transducers (WFSTs) allowing them to be used dynamically at training time.

Handwriting Recognition Speech Recognition

A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning

no code implementations3 Jun 2020 Sameer Khurana, Antoine Laurent, Wei-Ning Hsu, Jan Chorowski, Adrian Lancucki, Ricard Marxer, James Glass

Probabilistic Latent Variable Models (LVMs) provide an alternative to self-supervised learning approaches for linguistic representation learning from speech.

Representation Learning Self-Supervised Learning +1

Semi-Supervised Speech Recognition via Local Prior Matching

1 code implementation24 Feb 2020 Wei-Ning Hsu, Ann Lee, Gabriel Synnaeve, Awni Hannun

For sequence transduction tasks like speech recognition, a strong structured prior model encodes rich information about the target space, implicitly ruling out invalid sequences by assigning them low probability.

Knowledge Distillation Speech Recognition

Self-Supervised Speech Recognition via Local Prior Matching

no code implementations25 Sep 2019 Wei-Ning Hsu, Ann Lee, Gabriel Synnaeve, Awni Hannun

We propose local prior matching (LPM), a self-supervised objective for speech recognition.

Speech Recognition

Transfer Learning from Audio-Visual Grounding to Speech Recognition

no code implementations9 Jul 2019 Wei-Ning Hsu, David Harwath, James Glass

Transfer learning aims to reduce the amount of data required to excel at a new task by re-using the knowledge acquired from learning other related tasks.

Speech Recognition Transfer Learning +1

An Unsupervised Autoregressive Model for Speech Representation Learning

5 code implementations5 Apr 2019 Yu-An Chung, Wei-Ning Hsu, Hao Tang, James Glass

This paper proposes a novel unsupervised autoregressive neural model for learning generic speech representations.

General Classification Representation Learning +1

Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling

3 code implementations21 Feb 2019 Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X. Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, Yanzhang He, Jan Chorowski, Smit Hinsu, Stella Laurenzo, James Qin, Orhan Firat, Wolfgang Macherey, Suyog Gupta, Ankur Bapna, Shuyuan Zhang, Ruoming Pang, Ron J. Weiss, Rohit Prabhavalkar, Qiao Liang, Benoit Jacob, Bowen Liang, HyoukJoong Lee, Ciprian Chelba, Sébastien Jean, Bo Li, Melvin Johnson, Rohan Anil, Rajat Tibrewal, Xiaobing Liu, Akiko Eriguchi, Navdeep Jaitly, Naveen Ari, Colin Cherry, Parisa Haghani, Otavio Good, Youlong Cheng, Raziel Alvarez, Isaac Caswell, Wei-Ning Hsu, Zongheng Yang, Kuan-Chieh Wang, Ekaterina Gonina, Katrin Tomanek, Ben Vanik, Zelin Wu, Llion Jones, Mike Schuster, Yanping Huang, Dehao Chen, Kazuki Irie, George Foster, John Richardson, Klaus Macherey, Antoine Bruguier, Heiga Zen, Colin Raffel, Shankar Kumar, Kanishka Rao, David Rybach, Matthew Murray, Vijayaditya Peddinti, Maxim Krikun, Michiel A. U. Bacchiani, Thomas B. Jablin, Rob Suderman, Ian Williams, Benjamin Lee, Deepti Bhatia, Justin Carlson, Semih Yavuz, Yu Zhang, Ian McGraw, Max Galkin, Qi Ge, Golan Pundak, Chad Whipkey, Todd Wang, Uri Alon, Dmitry Lepikhin, Ye Tian, Sara Sabour, William Chan, Shubham Toshniwal, Baohua Liao, Michael Nirschl, Pat Rondon

Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models.

Sequence-To-Sequence Speech Recognition

Hierarchical Generative Modeling for Controllable Speech Synthesis

2 code implementations ICLR 2019 Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming Pang

This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions.

Speech Synthesis

Unsupervised Representation Learning of Speech for Dialect Identification

no code implementations12 Sep 2018 Suwon Shon, Wei-Ning Hsu, James Glass

In this paper, we explore the use of a factorized hierarchical variational autoencoder (FHVAE) model to learn an unsupervised latent representation for dialect identification (DID).

Dialect Identification Disentanglement

Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis

no code implementations30 Aug 2018 Yu-An Chung, Yuxuan Wang, Wei-Ning Hsu, Yu Zhang, RJ Skerry-Ryan

We demonstrate that the proposed framework enables Tacotron to generate intelligible speech using less than half an hour of paired training data.

Speech Synthesis

A Study of Enhancement, Augmentation, and Autoencoder Methods for Domain Adaptation in Distant Speech Recognition

no code implementations13 Jun 2018 Hao Tang, Wei-Ning Hsu, Francois Grondin, James Glass

Speech recognizers trained on close-talking speech do not generalize to distant speech and the word error rate degradation can be as large as 40% absolute.

Data Augmentation Distant Speech Recognition +2

Unsupervised Adaptation with Interpretable Disentangled Representations for Distant Conversational Speech Recognition

no code implementations13 Jun 2018 Wei-Ning Hsu, Hao Tang, James Glass

However, it is relatively inexpensive to collect large amounts of unlabeled data from domains that we want the models to generalize to.

Automatic Speech Recognition

Disentangling by Partitioning: A Representation Learning Framework for Multimodal Sensory Data

no code implementations29 May 2018 Wei-Ning Hsu, James Glass

In this paper, we present a partitioned variational autoencoder (PVAE) and several training objectives to learn disentangled representations, which encode not only the shared factors, but also modality-dependent ones, into separate latent variables.

Representation Learning Variational Inference

Scalable Factorized Hierarchical Variational Autoencoder Training

2 code implementations9 Apr 2018 Wei-Ning Hsu, James Glass

Deep generative models have achieved great success in unsupervised learning with the ability to capture complex nonlinear relationships between latent generating factors and observations.

Disentanglement Hyperparameter Optimization +4

Extracting Domain Invariant Features by Unsupervised Learning for Robust Automatic Speech Recognition

no code implementations7 Mar 2018 Wei-Ning Hsu, James Glass

The performance of automatic speech recognition (ASR) systems can be significantly compromised by previously unseen conditions, which is typically due to a mismatch between training and testing distributions.

Automatic Speech Recognition

Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data

3 code implementations NeurIPS 2017 Wei-Ning Hsu, Yu Zhang, James Glass

We present a factorized hierarchical variational autoencoder, which learns disentangled and interpretable representations from sequential data without supervision.

Automatic Speech Recognition Speaker Verification

Learning Latent Representations for Speech Generation and Transformation

no code implementations13 Apr 2017 Wei-Ning Hsu, Yu Zhang, James Glass

In this paper, we apply a convolutional VAE to model the generative process of natural speech.

Enhancing Automatically Discovered Multi-level Acoustic Patterns Considering Context Consistency With Applications in Spoken Term Detection

no code implementations7 Sep 2015 Cheng-Tao Chung, Wei-Ning Hsu, Cheng-Yi Lee, Lin-shan Lee

This paper presents a novel approach for enhancing the multiple sets of acoustic patterns automatically discovered from a given corpus.

Cannot find the paper you are looking for? You can Submit a new open access paper.