no code implementations • 8 Oct 2023 • Ryuichi Yamamoto, Reo Yoneyama, Lester Phillip Violeta, Wen-Chin Huang, Tomoki Toda
This paper presents our systems (denoted as T13) for the singing voice conversion challenge (SVCC) 2023.
no code implementations • 15 Sep 2023 • Reo Shimizu, Ryuichi Yamamoto, Masaya Kawamura, Yuma Shirahata, Hironori Doi, Tatsuya Komatsu, Kentaro Tachibana
We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system that allows control over speaker identity using natural language descriptions.
2 code implementations • 28 Oct 2022 • Ryuichi Yamamoto, Reo Yoneyama, Tomoki Toda
This paper describes the design of NNSVS, an open-source software for neural network-based singing voice synthesis research.
1 code implementation • 28 Oct 2022 • Reo Yoneyama, Ryuichi Yamamoto, Kentaro Tachibana
Neural audio super-resolution models are typically trained on low- and high-resolution audio signal pairs.
1 code implementation • 28 Oct 2022 • Masaya Kawamura, Yuma Shirahata, Ryuichi Yamamoto, Kentaro Tachibana
We propose a lightweight end-to-end text-to-speech model using multi-band generation and inverse short-time Fourier transform.
no code implementations • 28 Oct 2022 • Yuma Shirahata, Ryuichi Yamamoto, Eunwoo Song, Ryo Terashima, Jae-Min Kim, Kentaro Tachibana
From these features, the proposed periodicity generator produces a sample-level sinusoidal source that enables the waveform decoder to accurately reproduce the pitch.
no code implementations • 30 Jun 2022 • Eunwoo Song, Ryuichi Yamamoto, Ohsung Kwon, Chan-Ho Song, Min-Jae Hwang, Suhyeon Oh, Hyun-Wook Yoon, Jin-Seob Kim, Jae-Min Kim
In the proposed method, we first adopt a variational autoencoder whose posterior distribution is utilized to extract latent features representing acoustic similarity between the recorded and synthetic corpora.
no code implementations • 21 Apr 2022 • Ryo Terashima, Ryuichi Yamamoto, Eunwoo Song, Yuma Shirahata, Hyun-Wook Yoon, Jae-Min Kim, Kentaro Tachibana
Because pitch-shift data augmentation enables the coverage of a variety of pitch dynamics, it greatly stabilizes training for both VC and TTS models, even when only 1, 000 utterances of the target speaker's neutral data are available.
1 code implementation • 15 Oct 2021 • Tomoki Hayashi, Ryuichi Yamamoto, Takenori Yoshimura, Peter Wu, Jiatong Shi, Takaaki Saeki, Yooncheol Ju, Yusuke Yasuda, Shinnosuke Takamichi, Shinji Watanabe
This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit.
1 code implementation • 26 Apr 2021 • Kosuke Futamata, Byeongseon Park, Ryuichi Yamamoto, Kentaro Tachibana
We propose a novel phrase break prediction method that combines implicit features extracted from a pre-trained large language model, a. k. a BERT, and explicit features extracted from BiLSTM with linguistic features.
no code implementations • 27 Oct 2020 • Ryuichi Yamamoto, Eunwoo Song, Min-Jae Hwang, Jae-Min Kim
This paper proposes voicing-aware conditional discriminators for Parallel WaveGAN-based waveform synthesis systems.
1 code implementation • 31 Jan 2020 • Min-Jae Hwang, Eunwoo Song, Ryuichi Yamamoto, Frank Soong, Hong-Goo Kang
In this paper, we propose an improved LPCNet vocoder using a linear prediction (LP)-structured mixture density network (MDN).
12 code implementations • 25 Oct 2019 • Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim
We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network.
3 code implementations • 24 Oct 2019 • Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, Takenori Yoshimura, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Yu Zhang, Xu Tan
Furthermore, the unified design enables the integration of ASR functions with TTS, e. g., ASR-based objective evaluation and semi-supervised learning with both ASR and TTS models.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
1 code implementation • 13 Sep 2019 • Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, Shinji Watanabe, Takenori Yoshimura, Wangyou Zhang
Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS).
Ranked #11 on
Speech Recognition
on AISHELL-1
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
1 code implementation • 9 Apr 2019 • Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim
As this process encourages the student to model the distribution of realistic speech waveform, the perceptual quality of the synthesized speech becomes much more natural.