While many speech synthesis systems based on deep neural networks are thoroughly evaluated and released for free use in English, models for languages with far less active speakers like German are scarcely trained and most often not published for common use. This work covers specific challenges in training text to speech models for the German language, including dataset selection and data preprocessing, and presents the training process for multiple models of an end-to-end text to speech system based on a combination of Tacotron 2 and Multi-Band MelGAN. All model compositions were evaluated against the mean opinion score, which revealed comparable results to models in literature that are trained and evaluated on English datasets. In addition, empirical analyses identified distinct aspects influencing the quality of such systems, based on subjective user experience. All trained models are released for public use.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Text-To-Speech Synthesis HUI speech corpus Tacotron 2 Mean Opinion Score 3.74 # 1
Text-To-Speech Synthesis Thorsten voice 21.02 neutral Tacotron 2 Mean Opinion Score 3.49 # 1


No methods listed for this paper. Add relevant methods here