Search Results for author: Ron J. Weiss

Found 28 papers, 17 papers with code

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

11 code implementations • NeurIPS 2018 • Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Speaker Verification Speech Synthesis +3

50,652

Paper
Code

Tacotron: Towards End-to-End Speech Synthesis

29 code implementations • 29 Mar 2017 • Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.

Ranked #5 on Speech Synthesis on North American English

Audio Synthesis Speech Synthesis +1

50,652

Paper
Code

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

30 code implementations • 16 Dec 2017 • Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text.

Ranked #2 on Speech Synthesis on North American English

Speech Synthesis

28,936

Paper
Code

WaveGrad: Estimating Gradients for Waveform Generation

7 code implementations • ICLR 2021 • Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, William Chan

This paper introduces WaveGrad, a conditional model for waveform generation which estimates gradients of the data density.

Speech Synthesis Text-To-Speech Synthesis

28,936

Paper
Code

Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

4 code implementations • 9 Jul 2019 • Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, RJ Skerry-Ryan, Ye Jia, Andrew Rosenberg, Bhuvana Ramabhadran

We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages.

Speech Synthesis Voice Cloning

10,084

Paper
Code

CNN Architectures for Large-Scale Audio Classification

16 code implementations • 29 Sep 2016 • Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, Kevin Wilson

Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio.

Audio Classification Event Detection +1

2,968

Paper
Code

Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling

2 code implementations • 21 Feb 2019 • Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X. Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, Yanzhang He, Jan Chorowski, Smit Hinsu, Stella Laurenzo, James Qin, Orhan Firat, Wolfgang Macherey, Suyog Gupta, Ankur Bapna, Shuyuan Zhang, Ruoming Pang, Ron J. Weiss, Rohit Prabhavalkar, Qiao Liang, Benoit Jacob, Bowen Liang, HyoukJoong Lee, Ciprian Chelba, Sébastien Jean, Bo Li, Melvin Johnson, Rohan Anil, Rajat Tibrewal, Xiaobing Liu, Akiko Eriguchi, Navdeep Jaitly, Naveen Ari, Colin Cherry, Parisa Haghani, Otavio Good, Youlong Cheng, Raziel Alvarez, Isaac Caswell, Wei-Ning Hsu, Zongheng Yang, Kuan-Chieh Wang, Ekaterina Gonina, Katrin Tomanek, Ben Vanik, Zelin Wu, Llion Jones, Mike Schuster, Yanping Huang, Dehao Chen, Kazuki Irie, George Foster, John Richardson, Klaus Macherey, Antoine Bruguier, Heiga Zen, Colin Raffel, Shankar Kumar, Kanishka Rao, David Rybach, Matthew Murray, Vijayaditya Peddinti, Maxim Krikun, Michiel A. U. Bacchiani, Thomas B. Jablin, Rob Suderman, Ian Williams, Benjamin Lee, Deepti Bhatia, Justin Carlson, Semih Yavuz, Yu Zhang, Ian McGraw, Max Galkin, Qi Ge, Golan Pundak, Chad Whipkey, Todd Wang, Uri Alon, Dmitry Lepikhin, Ye Tian, Sara Sabour, William Chan, Shubham Toshniwal, Baohua Liao, Michael Nirschl, Pat Rondon

Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models.

Sequence-To-Sequence Speech Recognition

2,781

Paper
Code

State-of-the-art Speech Recognition With Sequence-to-Sequence Models

4 code implementations • 5 Dec 2017 • Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, Michiel Bacchiani

Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

574

Paper
Code

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

2 code implementations • ICML 2018 • RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, Rif A. Saurous

We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody.

Expressive Speech Synthesis

368

Paper
Code

Unsupervised speech representation learning using WaveNet autoencoders

5 code implementations • 25 Jan 2019 • Jan Chorowski, Ron J. Weiss, Samy Bengio, Aäron van den Oord

We consider the task of unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms.

Acoustic Unit Discovery Dimensionality Reduction +1

308

Paper
Code

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

5 code implementations • 5 Apr 2019 • Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, Yonghui Wu

This paper introduces a new speech corpus called "LibriTTS" designed for text-to-speech use.

Sound Audio and Speech Processing

222

Paper
Code

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

4 code implementations • 11 Oct 2018 • Quan Wang, Hannah Muckenhirn, Kevin Wilson, Prashant Sridhar, Zelin Wu, John Hershey, Rif A. Saurous, Ron J. Weiss, Ye Jia, Ignacio Lopez Moreno

In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker.

Speaker Recognition Speaker Separation +3

196

Paper
Code

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

3 code implementations • 17 Jun 2021 • Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, Najim Dehak, William Chan

The model takes an input phoneme sequence, and through an iterative refinement process, generates an audio waveform.

Speech Synthesis Text-To-Speech Synthesis

110

Paper
Code

Online and Linear-Time Attention by Enforcing Monotonic Alignments

2 code implementations • ICML 2017 • Colin Raffel, Minh-Thang Luong, Peter J. Liu, Ron J. Weiss, Douglas Eck

Recurrent neural network models with an attention mechanism have proven to be extremely effective on a wide variety of sequence-to-sequence problems.

Ranked #20 on Speech Recognition on TIMIT

Machine Translation Sentence +4

Paper
Code

Direct speech-to-speech translation with a sequence-to-sequence model

1 code implementation • 12 Apr 2019 • Ye Jia, Ron J. Weiss, Fadi Biadsy, Wolfgang Macherey, Melvin Johnson, Zhifeng Chen, Yonghui Wu

We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation.

Speech Synthesis Speech-to-Speech Translation +3

Paper
Code

Hierarchical Generative Modeling for Controllable Speech Synthesis

2 code implementations • ICLR 2019 • Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming Pang

This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions.

Attribute Speech Synthesis

Paper
Code

Sequence-to-Sequence Models Can Directly Translate Foreign Speech

1 code implementation • 24 Mar 2017 • Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui Wu, Zhifeng Chen

We present a recurrent encoder-decoder deep neural network architecture that directly translates speech in one language into text in another.

Machine Translation Sequence-To-Sequence Speech Recognition +2

Paper
Code

On Using Backpropagation for Speech Texture Generation and Voice Conversion

no code implementations • 22 Dec 2017 • Jan Chorowski, Ron J. Weiss, Rif A. Saurous, Samy Bengio

Inspired by recent work on neural network image generation which rely on backpropagation towards the network inputs, we present a proof-of-concept system for speech texture synthesis and voice conversion based on two mechanisms: approximate inversion of the representation learned by a speech recognition neural network, and on matching statistics of neuron activations between different source and target utterances.

Image Generation speech-recognition +4

Paper
Add Code

Multilingual Speech Recognition With A Single End-To-End Model

no code implementations • 6 Nov 2017 • Shubham Toshniwal, Tara N. Sainath, Ron J. Weiss, Bo Li, Pedro Moreno, Eugene Weinstein, Kanishka Rao

Training a conventional automatic speech recognition (ASR) system to support multiple languages is challenging because the sub-word unit, lexicon and word inventories are typically language specific.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation

no code implementations • 5 Nov 2018 • Ye Jia, Melvin Johnson, Wolfgang Macherey, Ron J. Weiss, Yuan Cao, Chung-Cheng Chiu, Naveen Ari, Stella Laurenzo, Yonghui Wu

In this paper, we demonstrate that using pre-trained MT or text-to-speech (TTS) synthesis models to convert weakly supervised data into speech-to-translation pairs for ST training can be more effective than multi-task learning.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Paper
Add Code

A spelling correction model for end-to-end speech recognition

no code implementations • 19 Feb 2019 • Jinxi Guo, Tara N. Sainath, Ron J. Weiss

Attention-based sequence-to-sequence models for speech recognition jointly train an acoustic model, language model (LM), and alignment mechanism using a single neural network and require only parallel audio-text pairs.

Language Modelling speech-recognition +2

Paper
Add Code

Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis

no code implementations • 6 Feb 2020 • Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Yonghui Wu

This paper proposes a hierarchical, fine-grained and interpretable latent variable model for prosody based on the Tacotron 2 text-to-speech model.

Disentanglement Speech Synthesis

Paper
Add Code

Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior

no code implementations • 6 Feb 2020 • Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Andrew Rosenberg, Bhuvana Ramabhadran, Yonghui Wu

Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody of synthesized speech.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Unsupervised Sound Separation Using Mixture Invariant Training

no code implementations • NeurIPS 2020 • Scott Wisdom, Efthymios Tzinis, Hakan Erdogan, Ron J. Weiss, Kevin Wilson, John R. Hershey

In such supervised approaches, a model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources.

Speech Enhancement Speech Separation +1

Paper
Add Code

Multitask Training with Text Data for End-to-End Speech Recognition

no code implementations • 27 Oct 2020 • Peidong Wang, Tara N. Sainath, Ron J. Weiss

We propose a multitask training method for attention-based end-to-end speech recognition models.

Language Modelling speech-recognition +1

Paper
Add Code

Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis

no code implementations • 6 Nov 2020 • Ron J. Weiss, RJ Skerry-Ryan, Eric Battenberg, Soroosh Mariooryad, Diederik P. Kingma

We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs.

Speech Synthesis Text-To-Speech Synthesis

Paper
Add Code

Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation

no code implementations • 1 Jun 2021 • Scott Wisdom, Aren Jansen, Ron J. Weiss, Hakan Erdogan, John R. Hershey

The best performance is achieved using larger numbers of output sources, enabled by our efficient MixIT loss, combined with sparsity losses to prevent over-separation.

Paper
Add Code

G-Augment: Searching for the Meta-Structure of Data Augmentation Policies for ASR

no code implementations • 19 Oct 2022 • Gary Wang, Ekin D. Cubuk, Andrew Rosenberg, Shuyang Cheng, Ron J. Weiss, Bhuvana Ramabhadran, Pedro J. Moreno, Quoc V. Le, Daniel S. Park

Data augmentation is a ubiquitous technique used to provide robustness to automatic speech recognition (ASR) training.

Ranked #1 on Speech Recognition on CHiME-6 eval

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.