In this paper, we examine simple extensions to a Transformer-based FastSpeech-like system, with the goal of improving prosody for multi-sentence TTS.
First, we propose a duration model conditioned on phrasing that improves the predicted durations and provides better modelling of pauses.
In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing speech with different speaker identities, b) generating speech with expressive and contextually appropriate prosody, and c) transferring prosody at fine-grained level between any pair of seen speakers.
We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody.
Many factors influence speech yielding different renditions of a given sentence.
In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text.