Audio generation (synthesis) is the task of generating raw audio such as speech.
( Image credit: MelNet )
|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length.
In this paper we propose a novel model for unconditional audio generation based on generating one audio sample at a time.
End-to-end models for raw audio generation are a challenge, specially if they have to work with non-parallel data, which is a desirable setup in many situations.
Unlike existing models, which explore solutions by focusing on a block of cascaded dilated convolutional layers, our methods address the gridding artifacts by smoothing the dilated convolution itself.