Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate a speech sample with the voice characteristic of an unseen speaker.
With the recent developments in cross-lingual Text-to-Speech (TTS) systems, L2 (second-language, or foreign) accent problems arise.
Two proposed modules are added to the end-to-end TTS framework: an intonation predictor and an intonation encoder.
The experimental results verify the effectiveness of the proposed method in terms of naturalness, intelligibility, and speaker generalization.
Flow-based generative models are composed of invertible transformations between two random variables of the same dimension.
Ranked #1 on Point Cloud Generation on ShapeNet Airplane