no code implementations • NAACL (ACL) 2022 • Hwa-Yeon Kim, Jong-Hwan Kim, Jae-Min Kim
Autoregressive transformer (ART)-based grapheme-to-phoneme (G2P) models have been proposed for bi/multilingual text-to-speech systems.
no code implementations • 28 Oct 2022 • Yuma Shirahata, Ryuichi Yamamoto, Eunwoo Song, Ryo Terashima, Jae-Min Kim, Kentaro Tachibana
From these features, the proposed periodicity generator produces a sample-level sinusoidal source that enables the waveform decoder to accurately reproduce the pitch.
no code implementations • 30 Jun 2022 • Eunwoo Song, Ryuichi Yamamoto, Ohsung Kwon, Chan-Ho Song, Min-Jae Hwang, Suhyeon Oh, Hyun-Wook Yoon, Jin-Seob Kim, Jae-Min Kim
In the proposed method, we first adopt a variational autoencoder whose posterior distribution is utilized to extract latent features representing acoustic similarity between the recorded and synthetic corpora.
no code implementations • 21 Apr 2022 • Ryo Terashima, Ryuichi Yamamoto, Eunwoo Song, Yuma Shirahata, Hyun-Wook Yoon, Jae-Min Kim, Kentaro Tachibana
Because pitch-shift data augmentation enables the coverage of a variety of pitch dynamics, it greatly stabilizes training for both VC and TTS models, even when only 1, 000 utterances of the target speaker's neutral data are available.
no code implementations • 27 Oct 2020 • Ryuichi Yamamoto, Eunwoo Song, Min-Jae Hwang, Jae-Min Kim
This paper proposes voicing-aware conditional discriminators for Parallel WaveGAN-based waveform synthesis systems.
11 code implementations • 25 Oct 2019 • Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim
We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network.
1 code implementation • 21 May 2019 • Ohsung Kwon, Eunwoo Song, Jae-Min Kim, Hong-Goo Kang
In this paper, we propose a high-quality generative text-to-speech (TTS) system using an effective spectrum and excitation estimation method.
1 code implementation • 9 Apr 2019 • Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim
As this process encourages the student to model the distribution of realistic speech waveform, the perceptual quality of the synthesized speech becomes much more natural.