Audio Generation

63 papers with code • 3 benchmarks • 8 datasets

Audio generation (synthesis) is the task of generating raw audio such as speech.

( Image credit: MelNet )

Benchmarks

Add a Result

These leaderboards are used to track progress in Audio Generation

Dataset	Best Model	Compare
AudioCaps	Audiobox	See all
Classical music, 5 seconds at 12 kHz	Sparse Transformer 152M (strided)	See all
Symphony music	SymphonyNet	See all

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

DDSP: Differentiable Digital Signal Processing

magenta/ddsp • • ICLR 2020

In this paper, we introduce the Differentiable Digital Signal Processing (DDSP) library, which enables direct integration of classic signal processing elements with deep learning methods.

Paper
Code

Taming Visually Guided Sound Generation

v-iashin/SpecVQGAN • • 17 Oct 2021

In this work, we propose a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it takes to play it on a single GPU.

Paper
Code

Differentiable Time-Frequency Scattering on GPU

cyrusvahidi/kymatio-wavespin • • 18 Apr 2022

Joint time-frequency scattering (JTFS) is a convolutional operator in the time-frequency domain which extracts spectrotemporal modulations at various rates and scales.

Paper
Code

BigVGAN: A Universal Neural Vocoder with Large-Scale Training

nvidia/bigvgan • • 9 Jun 2022

Despite recent progress in generative adversarial network (GAN)-based vocoders, where the model generates raw waveform conditioned on acoustic features, it is challenging to synthesize high-fidelity audio for numerous speakers across various recording environments.

Paper
Code

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

haoheliu/AudioLDM • • 29 Jan 2023

By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency.

Paper
Code

GACELA -- A generative adversarial context encoder for long audio inpainting

andimarafioti/GACELA • • 11 May 2020

We introduce GACELA, a generative adversarial network (GAN) designed to restore missing musical audio data with a duration ranging between hundreds of milliseconds to a few seconds, i. e., to perform long-gap audio inpainting.

Paper
Code

HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement

andreevp/wvmos • • 24 Mar 2022

Generative adversarial networks have recently demonstrated outstanding performance in neural vocoding outperforming best autoregressive and flow-based models.

Paper
Code

Enhancing Suno's Bark Text-to-Speech Model: Addressing Limitations Through Meta's Encodec and Pre-Trained Hubert

serp-ai/bark-with-voice-clone • • Social Science Research Network (SSRN) 2023

Keywords: Bark, ai voice cloning, Suno, text-to-speech, artificial intelligence, audio generation, Meta's encodec, audio codebooks, semantic tokens, HuBert, transformer-based model, multilingual speech, wav2vec, linear projection head, embedding space, generative capabilities, pretrained model checkpoints

Paper
Code

Invisible Watermarking for Audio Generation Diffusion Models

xirongc/watermark-audio-diffusion • • 22 Sep 2023

Diffusion models have gained prominence in the image domain for their capabilities in data generation and transformation, achieving state-of-the-art performance in various tasks in both image and audio domains.

Paper
Code

Fast Timing-Conditioned Latent Audio Diffusion

stability-ai/stable-audio-tools • • 7 Feb 2024

Generating long-form 44. 1kHz stereo audio from text prompts can be computationally demanding.

Paper
Code

Audio Generation

Benchmarks Add a Result

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result