Stable Audio Open

19 Jul 2024  ·  Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons ·

Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Audio Generation AudioCaps Stable Audio Open CLAP_MS 0.34 # 5
FD_openl3 78.24 # 3
CLAP_LAION 0.35 # 9
KL_passt 2.14 # 7
Text-to-Music Generation MusicCaps Stable Audio Open FAD 3.51 # 11
FD_openl3 127.20 # 3
FD 36.42 # 5
KL_passt 1.32 # 13
IS 2.93 # 3
CLAP_LAION 0.48 # 3
CLAP_MS 0.49 # 2

Methods


No methods listed for this paper. Add relevant methods here