Speech synthesis is the task of generating speech from text.

Please note that the state-of-the-art tables here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.

Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs

9 Sep 2019

We compare the results obtained from evaluating sentences in isolation, evaluating whole paragraphs of speech, and presenting a selection of speech or text as context and evaluating the subsequent speech.

DurIAN: Duration Informed Attention Network For Multimodal Synthesis

4 Sep 2019

In this paper, we present a generic and robust multimodal synthesis system that produces highly natural speech and facial expression simultaneously.

Maximizing Mutual Information for Tacotron

30 Aug 2019

What is more, we provide an indicator to detect errors in the predicted acoustic features as a byproduct.

Unpaired Image-to-Speech Synthesis with Multimodal Information Bottleneck

19 Aug 2019

We propose a multimodal information bottleneck approach that learns the correspondence between modalities from unpaired data (image and speech) by leveraging the shared modality (text).

DNN-based Speaker Embedding Using Subjective Inter-speaker Similarity for Multi-speaker Modeling in Speech Synthesis

19 Jul 2019

Although conventional DNN-based speaker embedding such as a $d$-vector can be applied to multi-speaker modeling in speech synthesis, it does not correlate with the subjective inter-speaker similarity and is not necessarily appropriate speaker representation for open speakers whose speech utterances are not included in the training data.

Multi-Speaker End-to-End Speech Synthesis

9 Jul 2019

In this work, we extend ClariNet (Ping et al., 2019), a fully end-to-end speech synthesis model (i. e., text-to-wave), to generate high-fidelity speech from multiple speakers.

Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

9 Jul 2019

We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages.

RUSLAN: Russian Spoken Language Corpus for Speech Synthesis

26 Jun 2019

We present RUSLAN -- a new open Russian spoken language corpus for the text-to-speech task.

A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation

18 Jun 2019

In this study, we propose a novel speech synthesis model, which can be adapted to unseen speakers by fine-tuning part of or all of the network using either transcribed or untranscribed speech.

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models

17 Jun 2019

For an input text, it is simultaneously passed into BERT and the Tacotron-2 encoder.