We introduce VampNet, a masked acoustic token modeling approach to music synthesis, compression, inpainting, and variation.
Language models have been successfully used to model natural signals, such as images, speech, and music.
We show that simple pitch and periodicity conditioning is insufficient for reducing this error relative to using autoregression.
In this paper, we propose NU-GAN, a new method for resampling audio from lower to higher sampling rates (upsampling).
In this paper, we show that it is possible to train GANs reliably to generate high quality coherent waveforms by introducing a set of architectural changes and simple training techniques.
Unsupervised learning is about capturing dependencies between variables and is driven by the contrast between the probable vs improbable configurations of these variables, often either via a generative model which only samples probable ones or with an energy function (unnormalized log-density) which is low for probable ones and high for improbable ones.
Maximum likelihood estimation of energy-based models is a challenging problem due to the intractability of the log-likelihood gradient.
We demonstrate a conditional autoregressive pipeline for efficient music recomposition, based on methods presented in van den Oord et al.(2017).
We present ObamaNet, the first architecture that generates both audio and synchronized photo-realistic lip-sync videos from any new text.
In this paper we propose a novel model for unconditional audio generation based on generating one audio sample at a time.