Audio Generation

60 papers with code • 3 benchmarks • 8 datasets

Audio generation (synthesis) is the task of generating raw audio such as speech.

( Image credit: MelNet )

Most implemented papers

WaveNet: A Generative Model for Raw Audio

ibab/tensorflow-wavenet 12 Sep 2016

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms.

Adversarial Audio Synthesis

chrisdonahue/wavegan ICLR 2019

Audio signals are sampled at high temporal resolutions, and learning to synthesize audio requires capturing structure across a range of timescales.

GANSynth: Adversarial Neural Audio Synthesis

tensorflow/magenta ICLR 2019

Efficient audio synthesis is an inherently difficult machine learning task, as human perception is sensitive to both global structure and fine-scale waveform coherence.

It's Raw! Audio Generation with State-Space Models

hazyresearch/state-spaces 20 Feb 2022

SaShiMi yields state-of-the-art performance for unconditional waveform generation in the autoregressive setting.

MelNet: A Generative Model for Audio in the Frequency Domain

fatchord/MelNet 4 Jun 2019

Capturing high-level structure in audio waveforms is challenging because a single second of audio spans tens of thousands of timesteps.

AudioLM: a Language Modeling Approach to Audio Generation

suno-ai/bark 7 Sep 2022

We introduce AudioLM, a framework for high-quality audio generation with long-term consistency.

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

soroushmehr/sampleRNN_ICLR2017 22 Dec 2016

In this paper we propose a novel model for unconditional audio generation based on generating one audio sample at a time.

Audio Super Resolution using Neural Networks

kuleshov/audio-super-res 2 Aug 2017

We introduce a new audio processing technique that increases the sampling rate of signals such as speech or music using deep convolutional neural networks.

Assisted Sound Sample Generation with Musical Conditioning in Adversarial Auto-Encoders

acids-ircam/Expressive_WAE_FADER 12 Apr 2019

Its training data subsets can directly be visualized in the 3D latent representation.

Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion

liusongxiang/StarGAN-Voice-Conversion NeurIPS 2019

End-to-end models for raw audio generation are a challenge, specially if they have to work with non-parallel data, which is a desirable setup in many situations.