no code implementations • 10 Mar 2024 • Yusuke Yasuda, Tomoki Toda
A preference-based subjective evaluation is a key method for evaluating generative media reliably.
no code implementations • 29 Aug 2023 • Cheng-Hung Hu, Yusuke Yasuda, Tomoki Toda
We propose a training framework of SQA models that can be trained with only preference scores derived from pairs of MOS to improve ranking prediction.
no code implementations • 16 Dec 2022 • Yusuke Yasuda, Tomoki Toda
We propose a TTS method based on latent variable conversion using a diffusion probabilistic model and the variational autoencoder (VAE).
no code implementations • 16 Dec 2022 • Yusuke Yasuda, Tomoki Toda
To tackle the challenge of rendering correct pitch accent in Japanese end-to-end TTS, we adopt PnG~BERT, a self-supervised pretrained model in the character and phoneme domain for TTS.
1 code implementation • 15 Oct 2021 • Tomoki Hayashi, Ryuichi Yamamoto, Takenori Yoshimura, Peter Wu, Jiatong Shi, Takaaki Saeki, Yooncheol Ju, Yusuke Yasuda, Shinnosuke Takamichi, Shinji Watanabe
This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit.
no code implementations • 10 Nov 2020 • Erica Cooper, Xin Wang, Yi Zhao, Yusuke Yasuda, Junichi Yamagishi
We explore pretraining strategies including choice of base corpus with the aim of choosing the best strategy for zero-shot multi-speaker end-to-end synthesis.
no code implementations • 19 Oct 2020 • Yusuke Yasuda, Xin Wang, Junichi Yamagishi
Explicit duration modeling is a key to achieving robust and efficient alignment in text-to-speech synthesis (TTS).
no code implementations • 20 May 2020 • Yusuke Yasuda, Xin Wang, Junichi Yamagishi
Our experiments suggest that a) a neural sequence-to-sequence TTS system should have a sufficient number of model parameters to produce high quality speech, b) it should also use a powerful encoder when it takes characters as inputs, and c) the encoder still has a room for improvement and needs to have an improved architecture to learn supra-segmental features more appropriately.
1 code implementation • 4 May 2020 • Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Junichi Yamagishi
This is followed by an analysis on synthesis quality, speaker and dialect similarity, and a remark on the effectiveness of our speaker augmentation approach.
no code implementations • 28 Oct 2019 • Yusuke Yasuda, Xin Wang, Junichi Yamagishi
Sequence-to-sequence text-to-speech (TTS) is dominated by soft-attention-based methods.
3 code implementations • 23 Oct 2019 • Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Fuming Fang, Xin Wang, Nanxin Chen, Junichi Yamagishi
While speaker adaptation for end-to-end speech synthesis using speaker embeddings can produce good speaker similarity for speakers seen during training, there remains a gap for zero-shot adaptation to unseen speakers.
Audio and Speech Processing
no code implementations • 30 Aug 2019 • Yusuke Yasuda, Xin Wang, Junichi Yamagishi
The advantages of our approach are that we can simplify many modules for the soft attention and that we can train the end-to-end TTS model using a single likelihood function.
1 code implementation • 29 Oct 2018 • Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi
Towards end-to-end Japanese speech synthesis, we extend Tacotron to systems with self-attention to capture long-term dependencies related to pitch accents and compare their audio quality with classical pipeline systems under various conditions to show their pros and cons.