1 code implementation • 10 Apr 2023 • Brian Yan, Jiatong Shi, Yun Tang, Hirofumi Inaguma, Yifan Peng, Siddharth Dalmia, Peter Polák, Patrick Fernandes, Dan Berrebbi, Tomoki Hayashi, Xiaohui Zhang, Zhaoheng Ni, Moto Hira, Soumi Maiti, Juan Pino, Shinji Watanabe
ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community.
1 code implementation • 22 Jan 2023 • Massa Baali, Tomoki Hayashi, Hamdy Mubarak, Soumi Maiti, Shinji Watanabe, Wassim El-Hajj, Ahmed Ali
Several high-resource Text to Speech (TTS) systems currently produce natural, well-established human-like speech.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
1 code implementation • 20 Sep 2022 • Masao Someki, Yosuke Higuchi, Tomoki Hayashi, Shinji Watanabe
In the field of deep learning, researchers often focus on inventing novel neural network models and improving benchmarks.
no code implementations • 10 Jul 2022 • Wen-Chin Huang, Shu-wen Yang, Tomoki Hayashi, Tomoki Toda
We present a large-scale comparative study of self-supervised speech representation (S3R)-based voice conversion (VC).
no code implementations • 17 Feb 2022 • Tatsuya Komatsu, Shinji Watanabe, Koichi Miyazaki, Tomoki Hayashi
In each iteration, the event's activity is estimated and used to condition the next output based on the probabilistic chain rule to form classifier chains.
no code implementations • 17 Dec 2021 • Jing Shi, Xuankai Chang, Tomoki Hayashi, Yen-Ju Lu, Shinji Watanabe, Bo Xu
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols, and convert the paradigm of the speech separation/enhancement related tasks from regression to classification.
1 code implementation • 24 Nov 2021 • Robin Karlsson, Tomoki Hayashi, Keisuke Fujii, Alexander Carballo, Kento Ohtani, Kazuya Takeda
Recent self-supervised models have demonstrated equal or better performance than supervised methods, opening for AI systems to learn visual representations from practically unlimited data.
Ranked #1 on
Unsupervised Semantic Segmentation
on COCO-Stuff-27
1 code implementation • 15 Oct 2021 • Tomoki Hayashi, Ryuichi Yamamoto, Takenori Yoshimura, Peter Wu, Jiatong Shi, Takaaki Saeki, Yooncheol Ju, Yusuke Yasuda, Shinnosuke Takamichi, Shinji Watanabe
This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit.
1 code implementation • 12 Oct 2021 • Wen-Chin Huang, Shu-wen Yang, Tomoki Hayashi, Hung-Yi Lee, Shinji Watanabe, Tomoki Toda
In this work, we provide a series of in-depth analyses by benchmarking on the two tasks in VCC2020, namely intra-/cross-lingual any-to-one (A2O) VC, as well as an any-to-any (A2A) setting.
no code implementations • 20 Jul 2021 • Wen-Chin Huang, Tomoki Hayashi, Xinjian Li, Shinji Watanabe, Tomoki Toda
In voice conversion (VC), an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents; these are then used as input by a text-to-speech (TTS) system to generate the converted speech.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 11 Jun 2021 • Ibuki Kuroyanagi, Tomoki Hayashi, Kazuya Takeda, Tomoki Toda
Our results showed that multi-task learning using binary classification and metric learning to consider the distance from each class centroid in the feature space is effective, and performance can be significantly improved by using even a small amount of anomalous data during training.
no code implementations • 14 Apr 2021 • Tomoki Hayashi, Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda
This paper proposes a novel voice conversion (VC) method based on non-autoregressive sequence-to-sequence (NAR-S2S) models.
1 code implementation • 4 Mar 2021 • Kazuhiro Kobayashi, Wen-Chin Huang, Yi-Chiao Wu, Patrick Lumban Tobing, Tomoki Hayashi, Tomoki Toda
In this paper, we present an open-source software for developing a nonparallel voice conversion (VC) system named crank.
no code implementations • 23 Oct 2020 • Wen-Chin Huang, Yi-Chiao Wu, Tomoki Hayashi, Tomoki Toda
Given a training dataset of the target speaker, we extract VQW2V and acoustic features to estimate a seq2seq mapping function from the former to the latter.
2 code implementations • 6 Oct 2020 • Wen-Chin Huang, Tomoki Hayashi, Shinji Watanabe, Tomoki Toda
This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
1 code implementation • 7 Aug 2020 • Wen-Chin Huang, Tomoki Hayashi, Yi-Chiao Wu, Hirokazu Kameoka, Tomoki Toda
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
1 code implementation • 25 Jul 2020 • Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda
To improve the pitch controllability and speech modeling capability, we apply a QP structure with PDCNNs to PWG, which introduces pitch information to the network by dynamically changing the network architecture corresponding to the auxiliary $F_{0}$ feature.
1 code implementation • 11 Jul 2020 • Yi-Chiao Wu, Tomoki Hayashi, Patrick Lumban Tobing, Kazuhiro Kobayashi, Tomoki Toda
In this paper, a pitch-adaptive waveform generative model named Quasi-Periodic WaveNet (QPNet) is proposed to improve the limited pitch controllability of vanilla WaveNet (WN) using pitch-dependent dilated convolution neural networks (PDCNNs).
1 code implementation • 18 May 2020 • Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda
In this paper, we propose a parallel WaveGAN (PWG)-like neural vocoder with a quasi-periodic (QP) architecture to improve the pitch controllability of PWG.
Audio and Speech Processing Sound
no code implementations • 12 May 2020 • Tomoki Hayashi, Shinji Watanabe
This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT).
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+5
1 code implementation • ACL 2020 • Hirofumi Inaguma, Shun Kiyono, Kevin Duh, Shigeki Karita, Nelson Enrique Yalta Soplin, Tomoki Hayashi, Shinji Watanabe
We present ESPnet-ST, which is designed for the quick development of speech-to-speech translation systems in a single framework.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
no code implementations • 3 Feb 2020 • Takenori Yoshimura, Tomoki Hayashi, Kazuya Takeda, Shinji Watanabe
The proposed method is publicly available.
1 code implementation • 14 Dec 2019 • Wen-Chin Huang, Tomoki Hayashi, Yi-Chiao Wu, Hirokazu Kameoka, Tomoki Toda
We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining.
3 code implementations • 24 Oct 2019 • Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, Takenori Yoshimura, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Yu Zhang, Xu Tan
Furthermore, the unified design enables the integration of ASR functions with TTS, e. g., ASR-based objective evaluation and semi-supervised learning with both ASR and TTS models.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
1 code implementation • 13 Sep 2019 • Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, Shinji Watanabe, Takenori Yoshimura, Wangyou Zhang
Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS).
Ranked #9 on
Speech Recognition
on AISHELL-1
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
2 code implementations • 24 Jul 2019 • Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda
In this work, to overcome this problem, we propose to use CycleVAE-based spectral model that indirectly optimizes the conversion flow by recycling the converted features back into the system to obtain corresponding cyclic reconstructed spectra that can be directly optimized.
1 code implementation • 21 Jul 2019 • Yi-Chiao Wu, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda
However, because of the fixed dilated convolution and generic network architecture, the WN vocoder lacks robustness against unseen input features and often requires a huge network size to achieve acceptable speech quality.
Audio and Speech Processing Sound
1 code implementation • 1 Jul 2019 • Yi-Chiao Wu, Tomoki Hayashi, Patrick Lumban Tobing, Kazuhiro Kobayashi, Tomoki Toda
In this paper, we propose a quasi-periodic neural network (QPNet) vocoder with a novel network architecture named pitch-dependent dilated convolution (PDCNN) to improve the pitch controllability of WaveNet (WN) vocoder.
1 code implementation • 2 May 2019 • Wen-Chin Huang, Yi-Chiao Wu, Chen-Chou Lo, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda, Yu Tsao, Hsin-Min Wang
Such hypothesis implies that during the conversion phase, the latent codes and the converted features in VAE based VC are in fact source F0 dependent.
no code implementations • 27 Nov 2018 • Wen-Chin Huang, Yi-Chiao Wu, Hsin-Te Hwang, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda, Yu Tsao, Hsin-Min Wang
Conventional WaveNet vocoders are trained with natural acoustic features but conditioned on the converted features in the conversion stage for VC, and such a mismatch often causes significant quality and similarity degradation.
no code implementations • 2 Nov 2018 • Takaaki Hori, Ramon Astudillo, Tomoki Hayashi, Yu Zhang, Shinji Watanabe, Jonathan Le Roux
To solve this problem, this work presents a loss that is based on the speech encoder state sequence instead of the raw speech signal.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 28 Jul 2018 • Tomoki Hayashi, Shinji Watanabe, Yu Zhang, Tomoki Toda, Takaaki Hori, Ramon Astudillo, Kazuya Takeda
In this paper we propose a novel data augmentation method for attention-based end-to-end automatic speech recognition (E2E-ASR), utilizing a large amount of text which is not paired with speech signals.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
no code implementations • 22 Apr 2018 • Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda
This paper presents a new network architecture called multi-head decoder for end-to-end speech recognition as an extension of a multi-head attention model.
no code implementations • 30 Mar 2018 • Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, Tsubasa Ochiai
This paper introduces a new open source platform for end-to-end speech processing named ESPnet.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1