Text-to-Speech Models


Introduced by Łańcucki in FastPitch: Parallel Text-to-speech with Pitch Prediction

FastPitch is a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The architecture of FastPitch is shown in the Figure. It is based on FastSpeech and composed mainly of two feed-forward Transformer (FFTr) stacks. The first one operates in the resolution of input tokens, the second one in the resolution of the output frames. Let $x=\left(x_{1}, \ldots, x_{n}\right)$ be the sequence of input lexical units, and $\mathbf{y}=\left(y_{1}, \ldots, y_{t}\right)$ be the sequence of target mel-scale spectrogram frames. The first FFTr stack produces the hidden representation $\mathbf{h}=\operatorname{FFTr}(\mathbf{x})$. The hidden representation $h$ is used to make predictions about the duration and average pitch of every character with a 1-D CNN

$$ \hat{\mathbf{d}}=\text { DurationPredictor }(\mathbf{h}), \quad \hat{\mathbf{p}}=\operatorname{PitchPredictor}(\mathbf{h}) $$

where $\hat{\mathbf{d}} \in \mathbb{N}^{n}$ and $\hat{\mathbf{p}} \in \mathbb{R}^{n}$. Next, the pitch is projected to match the dimensionality of the hidden representation $h \in$ $\mathbb{R}^{n \times d}$ and added to $\mathbf{h}$. The resulting sum $\mathbf{g}$ is discretely upsampled and passed to the output FFTr, which produces the output mel-spectrogram sequence

$$ \mathbf{g}=\mathbf{h}+\operatorname{PitchEmbedding}(\mathbf{p}) $$

$$ \hat{\mathbf{y}}=\operatorname{FFTr}\left([\underbrace{g_{1}, \ldots, g_{1}}_{d_{1}}, \ldots \underbrace{g_{n}, \ldots, g_{n}}_{d_{n}}]\right) $$

Ground truth $\mathbf{p}$ and $\mathbf{d}$ are used during training, and predicted $\hat{\mathbf{p}}$ and $\hat{\mathbf{d}}$ are used during inference. The model optimizes mean-squared error (MSE) between the predicted and ground-truth modalities

$$ \mathcal{L}=|\hat{\mathbf{y}}-\mathbf{y}|_{2}^{2}+\alpha|\hat{\mathbf{p}}-\mathbf{p}|_{2}^{2}+\gamma|\hat{\mathbf{d}}-\mathbf{d}|_{2}^{2} $$

Source: FastPitch: Parallel Text-to-speech with Pitch Prediction


Paper Code Results Date Stars