Audio generation (synthesis) is the task of generating raw audio such as speech.
|Trend||Dataset||Best Method||Paper title||Paper||Code||Compare|
Capturing high-level structure in audio waveforms is challenging because a single second of audio spans tens of thousands of timesteps.
End-to-end models for raw audio generation are a challenge, specially if they have to work with non-parallel data, which is a desirable setup in many situations.
Its training data subsets can directly be visualized in the 3D latent representation.
Unlike existing models, which explore solutions by focusing on a block of cascaded dilated convolutional layers, our methods address the gridding artifacts by smoothing the dilated convolution itself.
Based on this, we introduce a method for descriptor-based synthesis and show that we can control the descriptors of an instrument while keeping its timbre structure.
Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length.