no code implementations • 12 Apr 2017 • Shinnosuke Takamichi, Tomoki Koriyama, Hiroshi Saruwatari
To give synthetic speech natural inter-utterance variation, this paper builds DNN acoustic models that make it possible to randomly sample speech parameters.
no code implementations • 9 Feb 2019 • Hiroki Tamaru, Yuki Saito, Shinnosuke Takamichi, Tomoki Koriyama, Hiroshi Saruwatari
To address this problem, we use a GMMN to model the variation of the modulation spectrum of the pitch contour of natural singing voices and add a randomized inter-utterance variation to the pitch contour generated by conventional DNN-based singing voice synthesis.
no code implementations • 22 Apr 2020 • Tomoki Koriyama, Hiroshi Saruwatari
This paper presents a deep Gaussian process (DGP) model with a recurrent architecture for speech sequence modeling.
no code implementations • LREC 2020 • Yuki Yamashita, Tomoki Koriyama, Yuki Saito, Shinnosuke Takamichi, Yusuke Ijima, Ryo Masumura, Hiroshi Saruwatari
In this paper, we investigate the effectiveness of using rich annotations in deep neural network (DNN)-based statistical speech synthesis.
no code implementations • 7 Aug 2020 • Kentaro Mitsui, Tomoki Koriyama, Hiroshi Saruwatari
We propose a framework for multi-speaker speech synthesis using deep Gaussian processes (DGPs); a DGP is a deep architecture of Bayesian kernel regressions and thus robust to overfitting.
no code implementations • 31 Oct 2022 • Koichi Miyazaki, Masato Murata, Tomoki Koriyama
Automatic speech recognition (ASR) systems developed in recent years have shown promising results with self-attention models (e. g., Transformer and Conformer), which are replacing conventional recurrent neural networks.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 27 Feb 2023 • Dong Yang, Tomoki Koriyama, Yuki Saito, Takaaki Saeki, Detai Xin, Hiroshi Saruwatari
We also leverage duration-aware pause insertion for more natural multi-speaker TTS.
no code implementations • 1 Feb 2024 • Dong Yang, Tomoki Koriyama, Yuki Saito
Developing Text-to-Speech (TTS) systems that can synthesize natural breath is essential for human-like voice agents but requires extensive manual annotation of breath positions in training data.