Search Results for author: Wei-Ning Hsu

Found 73 papers, 30 papers with code

fairseq Sˆ2: A Scalable and Integrable Speech Synthesis Toolkit

1 code implementation • EMNLP (ACL) 2021 • Changhan Wang, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Ann Lee, Peng-Jen Chen, Jiatao Gu, Juan Pino

This paper presents fairseq Sˆ2, a fairseq extension for speech synthesis.

Speech Synthesis

29,183

Paper
Code

Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

1 code implementation • 15 Apr 2024 • Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, Soujanya Poria

These models do not explicitly focus on the presence of concepts or events and their temporal ordering in the output audio with respect to the input prompt.

Audio Generation

896

Paper
Code

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

no code implementations • 21 Mar 2024 • Hyojung Han, Mohamed Anwar, Juan Pino, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang

It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes.

Audio-Visual Speech Recognition Representation Learning +4

Paper
Add Code

Audiobox: Unified Audio Generation with Natural Language Prompts

no code implementations • 25 Dec 2023 • Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, Jeff Wang, Ivan Cruz, Bapi Akula, Akinniyi Akinyemi, Brian Ellis, Rashel Moritz, Yael Yungster, Alice Rakotoarison, Liang Tan, Chris Summers, Carleigh Wood, Joshua Lane, Mary Williamson, Wei-Ning Hsu

Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data.

Ranked #1 on Audio Generation on AudioCaps

AudioCaps Audio Generation +1

Paper
Add Code

Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency

no code implementations • 5 Nov 2023 • Sungho Jeon, Ching-Feng Yeh, Hakan Inan, Wei-Ning Hsu, Rashi Rungta, Yashar Mehdad, Daniel Bikel

In this paper, we show that a simple self-supervised pre-trained audio model can achieve comparable inference efficiency to more complicated pre-trained models with speech transformer encoders.

Quantization

Paper
Add Code

Generative Pre-training for Speech with Flow Matching

no code implementations • 25 Oct 2023 • Alexander H. Liu, Matt Le, Apoorv Vyas, Bowen Shi, Andros Tjandra, Wei-Ning Hsu

Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data.

Speech Enhancement Speech Synthesis +1

Paper
Add Code

Toward Joint Language Modeling for Speech Units and Text

no code implementations • 12 Oct 2023 • Ju-chieh Chou, Chung-Ming Chien, Wei-Ning Hsu, Karen Livescu, Arun Babu, Alexis Conneau, Alexei Baevski, Michael Auli

However, in the field of language modeling, very little effort has been made to model them jointly.

Language Modelling Spoken Language Understanding

Paper
Add Code

Low-Resource Self-Supervised Learning with SSL-Enhanced TTS

no code implementations • 29 Sep 2023 • Po-chun Hsu, Ali Elkahky, Wei-Ning Hsu, Yossi Adi, Tu Anh Nguyen, Jade Copet, Emmanuel Dupoux, Hung-Yi Lee, Abdelrahman Mohamed

Self-supervised learning (SSL) techniques have achieved remarkable results in various speech processing tasks.

Self-Supervised Learning

Paper
Add Code

EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

no code implementations • 10 Aug 2023 • Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, Emmanuel Dupoux

Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization).

Resynthesis Speech Synthesis

Paper
Add Code

Scaling Speech Technology to 1,000+ Languages

3 code implementations • arXiv 2023 • Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli

Expanding the language coverage of speech technology has the potential to improve access to information for many more people.

Automatic Speech Recognition Language Identification +4

29,185

Paper
Code

DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

1 code implementation • NeurIPS 2023 • Alexander H. Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, James R. Glass

In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering.

Clustering Language Modelling +3

Paper
Code

Cocktail HuBERT: Generalized Self-Supervised Pre-training for Mixture and Single-Source Speech

no code implementations • 20 Mar 2023 • Maryam Fazel-Zarandi, Wei-Ning Hsu

Self-supervised learning leverages unlabeled data effectively, improving label efficiency and generalization to domains without labeled data.

Self-Supervised Learning

Paper
Add Code

MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation

1 code implementation • 1 Mar 2023 • Mohamed Anwar, Bowen Shi, Vedanuj Goswami, Wei-Ning Hsu, Juan Pino, Changhan Wang

We introduce MuAViC, a multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation providing 1200 hours of audio-visual speech in 9 languages.

Audio-Visual Speech Recognition Robust Speech Recognition +4

335

Paper
Code

AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

no code implementations • 10 Feb 2023 • Jiachen Lian, Alexei Baevski, Wei-Ning Hsu, Michael Auli

Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems.

Audio-Visual Speech Recognition Self-Supervised Learning +2

Paper
Add Code

Scaling Laws for Generative Mixed-Modal Language Models

no code implementations • 10 Jan 2023 • Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, Luke Zettlemoyer

To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5-100 billion tokens.

Paper
Add Code

ReVISE: Self-Supervised Speech Resynthesis With Visual Input for Universal and Generalized Speech Regeneration

no code implementations • CVPR 2023 • Wei-Ning Hsu, Tal Remez, Bowen Shi, Jacob Donley, Yossi Adi

Moreover, we utilize self-supervised audio-visual speech model to initialize P-AVSR.

Audio-Visual Speech Recognition Resynthesis +5

Paper
Add Code

ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement

no code implementations • 21 Dec 2022 • Wei-Ning Hsu, Tal Remez, Bowen Shi, Jacob Donley, Yossi Adi

Moreover, we utilize self-supervised audio-visual speech model to initialize P-AVSR.

Ranked #1 on Speech Recognition on EasyCom

Audio-Visual Speech Recognition Resynthesis +6

Paper
Add Code

Efficient Speech Representation Learning with Low-Bit Quantization

no code implementations • 14 Dec 2022 • Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Abdelrahman Mohamed

With the development of hardware for machine learning, newer models often come at the cost of both increased sizes and computational complexity.

Model Compression Quantization +1

Paper
Add Code

Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

3 code implementations • 14 Dec 2022 • Alexei Baevski, Arun Babu, Wei-Ning Hsu, Michael Auli

Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources.

Ranked #91 on Image Classification on ImageNet

Image Classification Natural Language Understanding +3

29,186

Paper
Code

Continual Learning for On-Device Speech Recognition using Disentangled Conformers

no code implementations • 2 Dec 2022 • Anuj Diwan, Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Eunsol Choi, David Harwath, Abdelrahman Mohamed

Additionally, current speech recognition models and continual learning algorithms are not optimized to be compute-efficient.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Speech-to-Speech Translation For A Real-world Unwritten Language

no code implementations • arXiv 2022 • Peng-Jen Chen, Kevin Tran, Yilin Yang, Jingfei Du, Justine Kao, Yu-An Chung, Paden Tomasello, Paul-Ambroise Duquenne, Holger Schwenk, Hongyu Gong, Hirofumi Inaguma, Sravya Popuri, Changhan Wang, Juan Pino, Wei-Ning Hsu, Ann Lee

We use English-Taiwanese Hokkien as a case study, and present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.

Speech-to-Speech Translation Translation

Paper
Add Code

Simple and Effective Unsupervised Speech Translation

no code implementations • 18 Oct 2022 • Changhan Wang, Hirofumi Inaguma, Peng-Jen Chen, Ilia Kulikov, Yun Tang, Wei-Ning Hsu, Michael Auli, Juan Pino

The amount of labeled data to train models for speech tasks is limited for most languages, however, the data scarcity is exacerbated for speech translation which requires labeled data covering two different languages.

Machine Translation speech-recognition +6

Paper
Add Code

u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality

1 code implementation • 14 Jul 2022 • Wei-Ning Hsu, Bowen Shi

By utilizing modality dropout during pre-training, we demonstrate that a single fine-tuned model can achieve performance on par or better than the state-of-the-art modality-specific models.

Speaker Verification speech-recognition +1

776

Paper
Code

STOP: A dataset for Spoken Task Oriented Semantic Parsing

1 code implementation • 29 Jun 2022 • Paden Tomasello, Akshat Shrivastava, Daniel Lazar, Po-chun Hsu, Duc Le, Adithya Sagar, Ali Elkahky, Jade Copet, Wei-Ning Hsu, Yossi Adi, Robin Algayres, Tu Ahn Nguyen, Emmanuel Dupoux, Luke Zettlemoyer, Abdelrahman Mohamed

Furthermore, in addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

29,185

Paper
Code

Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT

1 code implementation • 15 May 2022 • Bowen Shi, Abdelrahman Mohamed, Wei-Ning Hsu

This paper investigates self-supervised pre-training for audio-visual speaker representation learning where a visual stream showing the speaker's mouth area is used alongside speech as inputs.

Representation Learning Speaker Verification

776

Paper
Code

On-demand compute reduction with stochastic wav2vec 2.0

no code implementations • 25 Apr 2022 • Apoorv Vyas, Wei-Ning Hsu, Michael Auli, Alexei Baevski

Our results for models pre-trained on 960h Librispeech dataset and fine-tuned on 10h of transcribed data show that using the same stochastic model, we get a smooth trade-off between word error rate (WER) and inference time with only marginal WER degradation compared to the W2V2 and SEW models trained for a specific setting.

Paper
Add Code

Unified Speech-Text Pre-training for Speech Translation and Recognition

no code implementations • ACL 2022 • Yun Tang, Hongyu Gong, Ning Dong, Changhan Wang, Wei-Ning Hsu, Jiatao Gu, Alexei Baevski, Xian Li, Abdelrahman Mohamed, Michael Auli, Juan Pino

Two pre-training configurations for speech translation and recognition, respectively, are presented to alleviate subtask interference.

speech-recognition Speech Recognition +1

Paper
Add Code

Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation

no code implementations • 6 Apr 2022 • Sravya Popuri, Peng-Jen Chen, Changhan Wang, Juan Pino, Yossi Adi, Jiatao Gu, Wei-Ning Hsu, Ann Lee

Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues as there exists little parallel S2ST data, compared to the amount of data available for conventional cascaded systems that consist of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +6

Paper
Add Code

Simple and Effective Unsupervised Speech Synthesis

no code implementations • 6 Apr 2022 • Alexander H. Liu, Cheng-I Jeff Lai, Wei-Ning Hsu, Michael Auli, Alexei Baevski, James Glass

We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe.

speech-recognition Speech Recognition +2

Paper
Add Code

Towards End-to-end Unsupervised Speech Recognition

1 code implementation • 5 Apr 2022 • Alexander H. Liu, Wei-Ning Hsu, Michael Auli, Alexei Baevski

Unsupervised speech recognition has shown great potential to make Automatic Speech Recognition (ASR) systems accessible to every language.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

29,183

Paper
Code

Generative Spoken Dialogue Language Modeling

no code implementations • 30 Mar 2022 • Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, Emmanuel Dupoux

We introduce dGSLM, the first "textless" model able to generate audio samples of naturalistic spoken dialogues.

Language Modelling

Paper
Add Code

Measuring the Impact of Individual Domain Factors in Self-Supervised Pre-Training

no code implementations • 1 Mar 2022 • Ramon Sanabria, Wei-Ning Hsu, Alexei Baevski, Michael Auli

In this paper, we present a controlled study to better understand the effect of such factors on the performance of pre-trained representations on automatic speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

textless-lib: a Library for Textless Spoken Language Processing

1 code implementation • NAACL (ACL) 2022 • Eugene Kharitonov, Jade Copet, Kushal Lakhotia, Tu Anh Nguyen, Paden Tomasello, Ann Lee, Ali Elkahky, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi

Textless spoken language processing research aims to extend the applicability of standard NLP toolset onto spoken language and languages with few or no textual resources.

Resynthesis

496

Paper
Code

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

9 code implementations • Preprint 2022 • Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli

While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind.

Ranked #1 on Paraphrase Identification on Quora Question Pairs (Accuracy metric)

Image Classification Linguistic Acceptability +5

124,489

Paper
Code

Robust Self-Supervised Audio-Visual Speech Recognition

1 code implementation • 5 Jan 2022 • Bowen Shi, Wei-Ning Hsu, Abdelrahman Mohamed

Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe.

Ranked #2 on Audio-Visual Speech Recognition on LRS3-TED (using extra training data)

Audio-Visual Speech Recognition Automatic Speech Recognition +5

776

Paper
Code

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

2 code implementations • ICLR 2022 • Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, Abdelrahman Mohamed

The lip-reading WER is further reduced to 26. 9% when using all 433 hours of labeled data from LRS3 and combined with self-training.

Ranked #1 on Speech Recognition on LRS3-TED (using extra training data)

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

776

Paper
Code

Textless Speech-to-Speech Translation on Real Data

no code implementations • NAACL 2022 • Ann Lee, Hongyu Gong, Paul-Ambroise Duquenne, Holger Schwenk, Peng-Jen Chen, Changhan Wang, Sravya Popuri, Yossi Adi, Juan Pino, Jiatao Gu, Wei-Ning Hsu

To our knowledge, we are the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs.

Speech-to-Speech Translation Translation

Paper
Add Code

Textless Speech Emotion Conversion using Decomposed & Discrete Representations

no code implementations • arXiv 2021 • Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu-Anh Nguyen, Morgane Rivière, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi

We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.

Paper
Add Code

Textless Speech Emotion Conversion using Discrete and Decomposed Representations

no code implementations • 14 Nov 2021 • Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu-Anh Nguyen, Morgane Rivière, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi

We use a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, and emotion.

Paper
Add Code

Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention

no code implementations • 15 Oct 2021 • Xutai Ma, Hongyu Gong, Danni Liu, Ann Lee, Yun Tang, Peng-Jen Chen, Wei-Ning Hsu, Phillip Koehn, Juan Pino

We present a direct simultaneous speech-to-speech translation (Simul-S2ST) model, Furthermore, the generation of translation is independent from intermediate text representations.

Speech Synthesis Speech-to-Speech Translation +1

Paper
Add Code

fairseq S^2: A Scalable and Integrable Speech Synthesis Toolkit

4 code implementations • 14 Sep 2021 • Changhan Wang, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Ann Lee, Peng-Jen Chen, Jiatao Gu, Juan Pino

This paper presents fairseq S^2, a fairseq extension for speech synthesis.

Speech Synthesis

29,183

Paper
Code

Text-Free Prosody-Aware Generative Spoken Language Modeling

1 code implementation • ACL 2022 • Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu-Anh Nguyen, Morgane Rivière, Abdelrahman Mohamed, Emmanuel Dupoux, Wei-Ning Hsu

Generative Spoken Language Modeling (GSLM) \cite{Lakhotia2021} is the only prior work addressing the generative aspects of speech pre-training, which replaces text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences.

Language Modelling

29,183

Paper
Code

Direct speech-to-speech translation with discrete units

1 code implementation • ACL 2022 • Ann Lee, Peng-Jen Chen, Changhan Wang, Jiatao Gu, Sravya Popuri, Xutai Ma, Adam Polyak, Yossi Adi, Qing He, Yun Tang, Juan Pino, Wei-Ning Hsu

When target text transcripts are available, we design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass.

Speech-to-Speech Translation Text Generation +1

157

Paper
Code

Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition

no code implementations • 14 Jun 2021 • Vimal Manohar, Tatiana Likhomanenko, Qiantong Xu, Wei-Ning Hsu, Ronan Collobert, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed

In this paper, we introduce the Kaizen framework that uses a continuously improving teacher to generate pseudo-labels for semi-supervised speech recognition (ASR).

speech-recognition Speech Recognition

Paper
Add Code

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

8 code implementations • 14 Jun 2021 • Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed

Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation.

Ranked #4 on Speech Recognition on LibriSpeech test-other

Clustering Language Modelling +2

124,457

Paper
Code

Unsupervised Speech Recognition

4 code implementations • NeurIPS 2021 • Alexei Baevski, Wei-Ning Hsu, Alexis Conneau, Michael Auli

Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe.

speech-recognition Speech Recognition +1

29,185

Paper
Code

Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

3 code implementations • 2 Apr 2021 • Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli

On a large-scale competitive setup, we show that pre-training on unlabeled in-domain data reduces the gap between models trained on in-domain and out-of-domain labeled data by 66%-73%.

Self-Supervised Learning

29,183

Paper
Code

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

2 code implementations • 1 Apr 2021 • Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux

We propose using self-supervised discrete representations for the task of speech resynthesis.

Disentanglement Resynthesis +2

353

Paper
Code

Generative Spoken Language Modeling from Raw Audio

2 code implementations • 1 Feb 2021 • Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Adelrahman Mohamed, Emmanuel Dupoux

We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation.

Ranked #1 on Resynthesis on LibriSpeech

Language Modelling Resynthesis

29,188

Paper
Code

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

no code implementations • ACL 2021 • Wei-Ning Hsu, David Harwath, Christopher Song, James Glass

In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision.

Image Captioning Speech Synthesis +1

Paper
Add Code

Differentiable Weighted Finite-State Transducers

1 code implementation • 2 Oct 2020 • Awni Hannun, Vineel Pratap, Jacob Kahn, Wei-Ning Hsu

We introduce a framework for automatic differentiation with weighted finite-state transducers (WFSTs) allowing them to be used dynamically at training time.

Handwriting Recognition speech-recognition +1

112

Paper
Code

A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning

no code implementations • 3 Jun 2020 • Sameer Khurana, Antoine Laurent, Wei-Ning Hsu, Jan Chorowski, Adrian Lancucki, Ricard Marxer, James Glass

Probabilistic Latent Variable Models (LVMs) provide an alternative to self-supervised learning approaches for linguistic representation learning from speech.

Representation Learning Self-Supervised Learning +1

Paper
Add Code

Semi-Supervised Speech Recognition via Local Prior Matching

1 code implementation • 24 Feb 2020 • Wei-Ning Hsu, Ann Lee, Gabriel Synnaeve, Awni Hannun

For sequence transduction tasks like speech recognition, a strong structured prior model encodes rich information about the target space, implicitly ruling out invalid sequences by assigning them low probability.

Ranked #45 on Speech Recognition on LibriSpeech test-other

Knowledge Distillation Language Modelling +2

6,331

Paper
Code

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

1 code implementation • ICLR 2020 • David Harwath, Wei-Ning Hsu, James Glass

What differentiates this paper from prior work on speech unit learning is the choice of training objective.

Image Retrieval Quantization +1

Paper
Code

Self-Supervised Speech Recognition via Local Prior Matching

no code implementations • 25 Sep 2019 • Wei-Ning Hsu, Ann Lee, Gabriel Synnaeve, Awni Hannun

We propose local prior matching (LPM), a self-supervised objective for speech recognition.

Language Modelling speech-recognition +1

Paper
Add Code

Transfer Learning from Audio-Visual Grounding to Speech Recognition

no code implementations • 9 Jul 2019 • Wei-Ning Hsu, David Harwath, James Glass

Transfer learning aims to reduce the amount of data required to excel at a new task by re-using the knowledge acquired from learning other related tasks.

speech-recognition Speech Recognition +2

Paper
Add Code

An Unsupervised Autoregressive Model for Speech Representation Learning

5 code implementations • 5 Apr 2019 • Yu-An Chung, Wei-Ning Hsu, Hao Tang, James Glass

This paper proposes a novel unsupervised autoregressive neural model for learning generic speech representations.

General Classification Representation Learning +1

184

Paper
Code

Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling

2 code implementations • 21 Feb 2019 • Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X. Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, Yanzhang He, Jan Chorowski, Smit Hinsu, Stella Laurenzo, James Qin, Orhan Firat, Wolfgang Macherey, Suyog Gupta, Ankur Bapna, Shuyuan Zhang, Ruoming Pang, Ron J. Weiss, Rohit Prabhavalkar, Qiao Liang, Benoit Jacob, Bowen Liang, HyoukJoong Lee, Ciprian Chelba, Sébastien Jean, Bo Li, Melvin Johnson, Rohan Anil, Rajat Tibrewal, Xiaobing Liu, Akiko Eriguchi, Navdeep Jaitly, Naveen Ari, Colin Cherry, Parisa Haghani, Otavio Good, Youlong Cheng, Raziel Alvarez, Isaac Caswell, Wei-Ning Hsu, Zongheng Yang, Kuan-Chieh Wang, Ekaterina Gonina, Katrin Tomanek, Ben Vanik, Zelin Wu, Llion Jones, Mike Schuster, Yanping Huang, Dehao Chen, Kazuki Irie, George Foster, John Richardson, Klaus Macherey, Antoine Bruguier, Heiga Zen, Colin Raffel, Shankar Kumar, Kanishka Rao, David Rybach, Matthew Murray, Vijayaditya Peddinti, Maxim Krikun, Michiel A. U. Bacchiani, Thomas B. Jablin, Rob Suderman, Ian Williams, Benjamin Lee, Deepti Bhatia, Justin Carlson, Semih Yavuz, Yu Zhang, Ian McGraw, Max Galkin, Qi Ge, Golan Pundak, Chad Whipkey, Todd Wang, Uri Alon, Dmitry Lepikhin, Ye Tian, Sara Sabour, William Chan, Shubham Toshniwal, Baohua Liao, Michael Nirschl, Pat Rondon

Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models.

Sequence-To-Sequence Speech Recognition

2,781

Paper
Code

Hierarchical Generative Modeling for Controllable Speech Synthesis

2 code implementations • ICLR 2019 • Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming Pang

This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions.

Attribute Speech Synthesis

Paper
Code

Unsupervised Representation Learning of Speech for Dialect Identification

no code implementations • 12 Sep 2018 • Suwon Shon, Wei-Ning Hsu, James Glass

In this paper, we explore the use of a factorized hierarchical variational autoencoder (FHVAE) model to learn an unsupervised latent representation for dialect identification (DID).

Dialect Identification Disentanglement

Paper
Add Code

Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis

no code implementations • 30 Aug 2018 • Yu-An Chung, Yuxuan Wang, Wei-Ning Hsu, Yu Zhang, RJ Skerry-Ryan

We demonstrate that the proposed framework enables Tacotron to generate intelligible speech using less than half an hour of paired training data.

Speech Synthesis

Paper
Add Code

A Study of Enhancement, Augmentation, and Autoencoder Methods for Domain Adaptation in Distant Speech Recognition

no code implementations • 13 Jun 2018 • Hao Tang, Wei-Ning Hsu, Francois Grondin, James Glass

Speech recognizers trained on close-talking speech do not generalize to distant speech and the word error rate degradation can be as large as 40% absolute.

Data Augmentation Distant Speech Recognition +3

Paper
Add Code

Unsupervised Adaptation with Interpretable Disentangled Representations for Distant Conversational Speech Recognition

no code implementations • 13 Jun 2018 • Wei-Ning Hsu, Hao Tang, James Glass

However, it is relatively inexpensive to collect large amounts of unlabeled data from domains that we want the models to generalize to.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Disentangling by Partitioning: A Representation Learning Framework for Multimodal Sensory Data

no code implementations • 29 May 2018 • Wei-Ning Hsu, James Glass

In this paper, we present a partitioned variational autoencoder (PVAE) and several training objectives to learn disentangled representations, which encode not only the shared factors, but also modality-dependent ones, into separate latent variables.

Representation Learning Variational Inference

Paper
Add Code

Scalable Factorized Hierarchical Variational Autoencoder Training

2 code implementations • 9 Apr 2018 • Wei-Ning Hsu, James Glass

Deep generative models have achieved great success in unsupervised learning with the ability to capture complex nonlinear relationships between latent generating factors and observations.

Disentanglement Hyperparameter Optimization +5

Paper
Code

Extracting Domain Invariant Features by Unsupervised Learning for Robust Automatic Speech Recognition

no code implementations • 7 Mar 2018 • Wei-Ning Hsu, James Glass

The performance of automatic speech recognition (ASR) systems can be significantly compromised by previously unseen conditions, which is typically due to a mismatch between training and testing distributions.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data

3 code implementations • NeurIPS 2017 • Wei-Ning Hsu, Yu Zhang, James Glass

We present a factorized hierarchical variational autoencoder, which learns disentangled and interpretable representations from sequential data without supervision.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

149

Paper
Code

Unsupervised Domain Adaptation for Robust Speech Recognition via Variational Autoencoder-Based Data Augmentation

no code implementations • 19 Jul 2017 • Wei-Ning Hsu, Yu Zhang, James Glass

Research on robust speech recognition can be regarded as trying to overcome this domain mismatch issue.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Learning Latent Representations for Speech Generation and Transformation

no code implementations • 13 Apr 2017 • Wei-Ning Hsu, Yu Zhang, James Glass

In this paper, we apply a convolutional VAE to model the generative process of natural speech.

Paper
Add Code

Neural Attention for Learning to Rank Questions in Community Question Answering

no code implementations • COLING 2016 • Salvatore Romeo, Giovanni Da San Martino, Alberto Barr{\'o}n-Cede{\~n}o, Aless Moschitti, ro, Yonatan Belinkov, Wei-Ning Hsu, Yu Zhang, Mitra Mohtarami, James Glass

In real-world data, e. g., from Web forums, text is often contaminated with redundant or irrelevant content, which leads to introducing noise in machine learning algorithms.

Community Question Answering Learning-To-Rank +3