Audio Generation

64 papers with code • 3 benchmarks • 8 datasets

Audio generation (synthesis) is the task of generating raw audio such as speech.

( Image credit: MelNet )

V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

heng-hw/V2A-Mapper 18 Aug 2023

In this paper, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM.

8
18 Aug 2023

AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining

haoheliu/AudioLDM2 10 Aug 2023

Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model.

2,049
10 Aug 2023

MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies

retrocirce/musicldm 3 Aug 2023

Diffusion models have shown promising results in cross-modal generation tasks, including text-to-image and text-to-audio generation.

127
03 Aug 2023

WavJourney: Compositional Audio Creation with Large Language Models

audio-agi/wavjourney 26 Jul 2023

Subjective evaluations demonstrate the potential of WavJourney in crafting engaging storytelling audio content from text.

505
26 Jul 2023

High-Fidelity Audio Compression with Improved RVQGAN

descriptinc/descript-audio-codec NeurIPS 2023

Language models have been successfully used to model natural signals, such as images, speech, and music.

856
11 Jun 2023

MuseCoco: Generating Symbolic Music from Text

microsoft/muzic 31 May 2023

In contrast, symbolic music offers ease of editing, making it more accessible for users to manipulate specific musical elements.

4,194
31 May 2023

An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization

kong13661/pia 26 May 2023

Therefore, we also explore the robustness of diffusion models to MIA in the text-to-speech (TTS) task, which is an audio generation task.

6
26 May 2023

Any-to-Any Generation via Composable Diffusion

microsoft/i-Code NeurIPS 2023

We present Composable Diffusion (CoDi), a novel generative model capable of generating any combination of output modalities, such as language, image, video, or audio, from any combination of input modalities.

1,634
19 May 2023

SoundStorm: Efficient Parallel Audio Generation

lucidrains/soundstorm-pytorch 16 May 2023

We present SoundStorm, a model for efficient, non-autoregressive audio generation.

1,115
16 May 2023

LooPy: A Research-Friendly Mix Framework for Music Information Retrieval on Electronic Dance Music

gariscat/loopy 1 May 2023

Music information retrieval (MIR) has gone through an explosive development with the advancement of deep learning in recent years.

22
01 May 2023