Audio generation (synthesis) is the task of generating raw audio such as speech.
|Trend||Dataset||Best Method||Paper title||Paper||Code||Compare|
Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length.
In this paper we propose a novel model for unconditional audio generation based on generating one audio sample at a time.
Unlike existing models, which explore solutions by focusing on a block of cascaded dilated convolutional layers, our methods address the gridding artifacts by smoothing the dilated convolution itself.
End-to-end models for raw audio generation are a challenge, specially if they have to work with non-parallel data, which is a desirable setup in many situations.
Based on this, we introduce a method for descriptor-based synthesis and show that we can control the descriptors of an instrument while keeping its timbre structure.
Its training data subsets can directly be visualized in the 3D latent representation.