Jukebox

Introduced by Dhariwal et al. in Jukebox: A Generative Model for Music

Jukebox is a model that generates music with singing in the raw audio domain. It tackles the long context of raw audio using a multi-scale VQ-VAE to compress it to discrete codes, and modeling those using autoregressive Transformers. It can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable.

Three separate VQ-VAE models are trained with different temporal resolutions. At each level, the input audio is segmented and encoded into latent vectors $\mathbf{h}_{t}$, which are then quantized to the closest codebook vectors $\mathbf{e}_{z_{t}}$. The code $z_{t}$ is a discrete representation of the audio that we later train our prior on. The decoder takes the sequence of codebook vectors and reconstructs the audio. The top level learns the highest degree of abstraction, since it is encoding longer audio per token while keeping the codebook size the same. Audio can be reconstructed using the codes at any one of the abstraction levels, where the least abstract bottom-level codes result in the highest-quality audio.

Source: Jukebox: A Generative Model for Music

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Information Retrieval	3	11.54%
Music Information Retrieval	3	11.54%
Retrieval	3	11.54%
Music Generation	3	11.54%
Genre classification	2	7.69%
Audio Source Separation	2	7.69%
Quantization	1	3.85%
Code Generation	1	3.85%
Text Generation	1	3.85%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Convolution	Convolutions
Dilated Convolution	Convolutions
Layer Normalization	Normalization
Position-Wise Feed-Forward Layer	Feedforward Networks
Residual Connection	Skip Connections
VQ-VAE	Generative Models

Categories

Add Remove

Generative Audio Models