Audio generation (synthesis) is the task of generating raw audio such as speech.
|Trend||Dataset||Best Method||Paper title||Paper||Code||Compare|
Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length.
In this paper we propose a novel model for unconditional audio generation based on generating one audio sample at a time.
Capturing high-level structure in audio waveforms is challenging because a single second of audio spans tens of thousands of timesteps.
End-to-end models for raw audio generation are a challenge, specially if they have to work with non-parallel data, which is a desirable setup in many situations.
Unlike existing models, which explore solutions by focusing on a block of cascaded dilated convolutional layers, our methods address the gridding artifacts by smoothing the dilated convolution itself.
Based on this, we introduce a method for descriptor-based synthesis and show that we can control the descriptors of an instrument while keeping its timbre structure.