Audio generation (synthesis) is the task of generating raw audio such as speech.
In this paper we propose a novel model for unconditional audio generation based on generating one audio sample at a time.
However, dilated convolutions suffer from the gridding artifacts, which hampers the performance of DCNNs with dilated convolutions.
Based on this, we introduce a method for descriptor-based synthesis and show that we can control the descriptors of an instrument while keeping its timbre structure.
Its training data subsets can directly be visualized in the 3D latent representation.