This paper presents our systems (denoted as T13) for the singing voice conversion challenge (SVCC) 2023.
We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system that allows control over speaker identity using natural language descriptions.
This paper describes the design of NNSVS, an open-source software for neural network-based singing voice synthesis research.
Neural audio super-resolution models are typically trained on low- and high-resolution audio signal pairs.
We propose a lightweight end-to-end text-to-speech model using multi-band generation and inverse short-time Fourier transform.
From these features, the proposed periodicity generator produces a sample-level sinusoidal source that enables the waveform decoder to accurately reproduce the pitch.
In the proposed method, we first adopt a variational autoencoder whose posterior distribution is utilized to extract latent features representing acoustic similarity between the recorded and synthetic corpora.
Because pitch-shift data augmentation enables the coverage of a variety of pitch dynamics, it greatly stabilizes training for both VC and TTS models, even when only 1, 000 utterances of the target speaker's neutral data are available.
This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit.
We propose a novel phrase break prediction method that combines implicit features extracted from a pre-trained large language model, a. k. a BERT, and explicit features extracted from BiLSTM with linguistic features.
This paper proposes voicing-aware conditional discriminators for Parallel WaveGAN-based waveform synthesis systems.
In this paper, we propose an improved LPCNet vocoder using a linear prediction (LP)-structured mixture density network (MDN).
We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network.
Furthermore, the unified design enables the integration of ASR functions with TTS, e. g., ASR-based objective evaluation and semi-supervised learning with both ASR and TTS models.
1 code implementation • 13 Sep 2019 • Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, Shinji Watanabe, Takenori Yoshimura, Wangyou Zhang
Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS).
Ranked #11 on Speech Recognition on AISHELL-1
As this process encourages the student to model the distribution of realistic speech waveform, the perceptual quality of the synthesized speech becomes much more natural.