TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Speech Synthesis	Mandarin Chinese	WaveNet (L+F)	Mean Opinion Score	4.08	# 1
Speech Synthesis	Mandarin Chinese	HMM-driven concatenative	Mean Opinion Score	3.47	# 3
Speech Synthesis	Mandarin Chinese	LSTM-RNN parametric	Mean Opinion Score	3.79	# 2
Speech Synthesis	North American English	HMM-driven concatenative	Mean Opinion Score	3.86	# 5
Speech Synthesis	North American English	LSTM-RNN parametric	Mean Opinion Score	3.67	# 6
Speech Synthesis	North American English	WaveNet (L+F)	Mean Opinion Score	4.21	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/wavenet-a-generative-model-for-raw-audio/speech-synthesis-on-mandarin-chinese)](https://paperswithcode.com/sota/speech-synthesis-on-mandarin-chinese?p=wavenet-a-generative-model-for-raw-audio)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/wavenet-a-generative-model-for-raw-audio/speech-synthesis-on-north-american-english)](https://paperswithcode.com/sota/speech-synthesis-on-north-american-english?p=wavenet-a-generative-model-for-raw-audio)`

WaveNet: A Generative Model for Raw Audio

12 Sep 2016 · Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu ·

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

PDF Abstract

Code

Add Remove Mark official

ibab/tensorflow-wavenet

5,400

awslabs/gluon-ts

4,268

basveeling/wavenet

1,059

vincentherrmann/pytorch-wavenet

945

mindspore-ai/models

334

See all 60 implementations

Tasks

Add Remove

Audio Generation

Speech Synthesis

Datasets

Add Datasets introduced or used in this paper

Results from the Paper

Edit

Ranked #1 on Speech Synthesis on Mandarin Chinese

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Speech Synthesis	Mandarin Chinese	WaveNet (L+F)	Mean Opinion Score	4.08	# 1	Compare
Speech Synthesis	Mandarin Chinese	HMM-driven concatenative	Mean Opinion Score	3.47	# 3	Compare
Speech Synthesis	Mandarin Chinese	LSTM-RNN parametric	Mean Opinion Score	3.79	# 2	Compare
Speech Synthesis	North American English	HMM-driven concatenative	Mean Opinion Score	3.86	# 5	Compare
Speech Synthesis	North American English	LSTM-RNN parametric	Mean Opinion Score	3.67	# 6	Compare
Speech Synthesis	North American English	WaveNet (L+F)	Mean Opinion Score	4.21	# 3	Compare

Methods

Add Remove

Causal Convolution • Dilated Causal Convolution • Mixture of Logistic Distributions • WaveNet

Edit Social Preview

WaveNet: A Generative Model for Raw Audio

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove