Audio Generation
65 papers with code • 3 benchmarks • 9 datasets
Audio generation (synthesis) is the task of generating raw audio such as speech.
( Image credit: MelNet )
Datasets
Latest papers
LooPy: A Research-Friendly Mix Framework for Music Information Retrieval on Electronic Dance Music
Music information retrieval (MIR) has gone through an explosive development with the advancement of deep learning in recent years.
Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model
The immense scale of the recent large language models (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many natural language processing (NLP) tasks.
Enhancing Suno's Bark Text-to-Speech Model: Addressing Limitations Through Meta's Encodec and Pre-Trained Hubert
Keywords: Bark, ai voice cloning, Suno, text-to-speech, artificial intelligence, audio generation, Meta's encodec, audio codebooks, semantic tokens, HuBert, transformer-based model, multilingual speech, wav2vec, linear projection head, embedding space, generative capabilities, pretrained model checkpoints
Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation
In this work, we concentrate on a rarely investigated problem of text guided sounding video generation and propose the Sounding Video Generator (SVG), a unified framework for generating realistic videos along with audio signals.
AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis
Can machines recording an audio-visual scene produce realistic, matching audio-visual experiences at novel positions and novel view directions?
ArchiSound: Audio Generation with Diffusion
The recent surge in popularity of diffusion models for image generation has brought new attention to the potential of these models in other areas of media generation.
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency.
AudioGen: Textually Guided Audio Generation
Finally, we explore the ability of the proposed method to generate audio continuation conditionally and unconditionally.
AudioLM: a Language Modeling Approach to Audio Generation
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency.