Search Results for author: Hiroshi Saruwatari

Found 42 papers, 8 papers with code

Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities

no code implementations • 10 Apr 2017 • Hiroyuki Miyoshi, Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari

Conventional VC using shared context posterior probabilities predicts target speech parameters from the context posterior probabilities estimated from the source speech parameters.

speech-recognition Speech Recognition +2

Paper
Add Code

Sampling-based speech parameter generation using moment-matching networks

no code implementations • 12 Apr 2017 • Shinnosuke Takamichi, Tomoki Koriyama, Hiroshi Saruwatari

To give synthetic speech natural inter-utterance variation, this paper builds DNN acoustic models that make it possible to randomly sample speech parameters.

Speech Synthesis

Paper
Add Code

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

4 code implementations • 23 Sep 2017 • Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari

In the proposed framework incorporating the GANs, the discriminator is trained to distinguish natural and generated speech parameters, while the acoustic models are trained to minimize the weighted sum of the conventional minimum generation loss and an adversarial loss for deceiving the discriminator.

Speech Synthesis Voice Conversion

514

Paper
Code

JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis

1 code implementation • 28 Oct 2017 • Ryosuke Sonobe, Shinnosuke Takamichi, Hiroshi Saruwatari

Thanks to improvements in machine learning techniques including deep learning, a free large-scale speech corpus that can be shared between academic institutions and commercial companies has an important role.

BIG-bench Machine Learning Speech Synthesis

Paper
Code

CPJD Corpus: Crowdsourced Parallel Speech Corpus of Japanese Dialects

no code implementations • LREC 2018 • Shinnosuke Takamichi, Hiroshi Saruwatari

Machine Translation Speech Recognition +1

Paper
Add Code

Phase reconstruction from amplitude spectrograms based on von-Mises-distribution deep neural network

2 code implementations • 10 Jul 2018 • Shinnosuke Takamichi, Yuki Saito, Norihiro Takamune, Daichi Kitamura, Hiroshi Saruwatari

This paper presents a deep neural network (DNN)-based phase reconstruction from amplitude spectrograms.

Sound Audio and Speech Processing

Paper
Code

Generative Moment Matching Network-based Random Modulation Post-filter for DNN-based Singing Voice Synthesis and Neural Double-tracking

no code implementations • 9 Feb 2019 • Hiroki Tamaru, Yuki Saito, Shinnosuke Takamichi, Tomoki Koriyama, Hiroshi Saruwatari

To address this problem, we use a GMMN to model the variation of the modulation spectrum of the pitch contour of natural singing voices and add a randomized inter-utterance variation to the pitch contour generated by conventional DNN-based singing voice synthesis.

Singing Voice Synthesis

Paper
Add Code

DNN-based Speaker Embedding Using Subjective Inter-speaker Similarity for Multi-speaker Modeling in Speech Synthesis

no code implementations • 19 Jul 2019 • Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari

Although conventional DNN-based speaker embedding such as a $d$-vector can be applied to multi-speaker modeling in speech synthesis, it does not correlate with the subjective inter-speaker similarity and is not necessarily appropriate speaker representation for open speakers whose speech utterances are not included in the training data.

Speech Synthesis

Paper
Add Code

V2S attack: building DNN-based voice conversion from automatic speaker verification

no code implementations • 5 Aug 2019 • Taiki Nakamura, Yuki Saito, Shinnosuke Takamichi, Yusuke Ijima, Hiroshi Saruwatari

The experimental evaluation compares converted voices between the proposed method that does not use the targeted speaker's voice data and the standard VC that uses the data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

HumanGAN: generative adversarial network with human-based discriminator and its evaluation in speech perception modeling

no code implementations • 25 Sep 2019 • Kazuki Fujii, Yuki Saito, Shinnosuke Takamichi, Yukino Baba, Hiroshi Saruwatari

To model the human-acceptable distribution, we formulate a backpropagation-based generator training algorithm by regarding human perception as a black-boxed discriminator.

Generative Adversarial Network

Paper
Add Code

Time-Domain Audio Source Separation Based on Wave-U-Net Combined with Discrete Wavelet Transform

1 code implementation • 28 Jan 2020 • Tomohiko Nakamura, Hiroshi Saruwatari

With this belief, focusing on the fact that the DWT has an anti-aliasing filter and the perfect reconstruction property, we design the proposed layers.

Audio Source Separation Music Source Separation

Paper
Code

Utterance-level Sequential Modeling For Deep Gaussian Process Based Speech Synthesis Using Simple Recurrent Unit

no code implementations • 22 Apr 2020 • Tomoki Koriyama, Hiroshi Saruwatari

This paper presents a deep Gaussian process (DGP) model with a recurrent architecture for speech sequence modeling.

Speech Synthesis

Paper
Add Code

DNN-based Speech Synthesis Using Abundant Tags of Spontaneous Speech Corpus

no code implementations • LREC 2020 • Yuki Yamashita, Tomoki Koriyama, Yuki Saito, Shinnosuke Takamichi, Yusuke Ijima, Ryo Masumura, Hiroshi Saruwatari

In this paper, we investigate the effectiveness of using rich annotations in deep neural network (DNN)-based statistical speech synthesis.

Speech Synthesis

Paper
Add Code

SMASH Corpus: A Spontaneous Speech Corpus Recording Third-person Audio Commentaries on Gameplay

no code implementations • LREC 2020 • Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari

Developing a spontaneous speech corpus would be beneficial for spoken language processing and understanding.

Paper
Add Code

Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes

no code implementations • 7 Aug 2020 • Kentaro Mitsui, Tomoki Koriyama, Hiroshi Saruwatari

We propose a framework for multi-speaker speech synthesis using deep Gaussian processes (DGPs); a DGP is a deep architecture of Bayesian kernel regressions and thus robust to overfitting.

Gaussian Processes Speech Synthesis +1

Paper
Add Code

HumanACGAN: conditional generative adversarial network with human-based auxiliary classifier and its evaluation in phoneme perception

no code implementations • 8 Feb 2021 • Yota Ueda, Kazuki Fujii, Yuki Saito, Shinnosuke Takamichi, Yukino Baba, Hiroshi Saruwatari

A DNN-based generator is trained using a human-based discriminator, i. e., humans' perceptual evaluations, instead of the GAN's DNN-based discriminator.

Generative Adversarial Network

Paper
Add Code

Sampling-Frequency-Independent Audio Source Separation Using Convolution Layer Based on Impulse Invariant Method

1 code implementation • 10 May 2021 • Koichi Saito, Tomohiko Nakamura, Kohei Yatabe, Yuma Koizumi, Hiroshi Saruwatari

Audio source separation is often used as preprocessing of various applications, and one of its ultimate goals is to construct a single versatile model capable of dealing with the varieties of audio signals.

Audio Source Separation Music Source Separation

Paper
Code

Binaural rendering from microphone array signals of arbitrary geometry

no code implementations • 15 Sep 2021 • Naoto Iijima, Shoichi Koyama, Hiroshi Saruwatari

To reproduce binaural signals from microphone array recordings at a remote location, a spherical microphone array is generally used for capturing a soundfield.

Position

Paper
Add Code

Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context Prediction Network

no code implementations • 22 Sep 2021 • Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari

Although this method achieves comparable speech quality to that of a method that waits for the future context, it entails a huge amount of processing for sampling from the language model at each time step.

Knowledge Distillation Language Modelling +2

Paper
Add Code

Differentiable Digital Signal Processing Mixture Model for Synthesis Parameter Extraction from Mixture of Harmonic Sounds

no code implementations • 1 Feb 2022 • Masaya Kawamura, Tomohiko Nakamura, Daichi Kitamura, Hiroshi Saruwatari, Yu Takahashi, Kazunobu Kondo

A differentiable digital signal processing (DDSP) autoencoder is a musical sound synthesizer that combines a deep neural network (DNN) and spectral modeling synthesis.

Audio Source Separation

Paper
Add Code

Spatial active noise control based on individual kernel interpolation of primary and secondary sound fields

no code implementations • 10 Feb 2022 • Kazuyuki Arikawa, Shoichi Koyama, Hiroshi Saruwatari

A spatial active noise control (ANC) method based on the individual kernel interpolation of primary and secondary sound fields is proposed.

Paper
Add Code

STUDIES: Corpus of Japanese Empathetic Dialogue Speech Towards Friendly Voice Agent

no code implementations • 28 Mar 2022 • Yuki Saito, Yuto Nishimura, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari

We describe our methodology to construct an empathetic dialogue speech corpus and report the analysis results of the STUDIES corpus.

Paper
Add Code

Region-to-region kernel interpolation of acoustic transfer function with directional weighting

no code implementations • 5 May 2022 • Juliano G. C. Ribeiro, Shoichi Koyama, Hiroshi Saruwatari

A method of interpolating the acoustic transfer function (ATF) between regions that takes into account both the physical properties of the ATF and the directionality of region configurations is proposed.

Hyperparameter Optimization

Paper
Add Code

Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis Using Linguistic and Prosodic Contexts of Dialogue History

no code implementations • 16 Jun 2022 • Yuto Nishimura, Yuki Saito, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari

To train the empathetic DSS model effectively, we investigate 1) a self-supervised learning model pretrained with large speech corpora, 2) a style-guided training using a prosody embedding of the current utterance to be predicted by the dialogue context embedding, 3) a cross-modal attention to combine text and speech modalities, and 4) a sentence-wise embedding to achieve fine-grained prosody modeling rather than utterance-wise modeling.

Self-Supervised Learning Sentence +2

Paper
Add Code

Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS

no code implementations • 21 Jun 2022 • Kenta Udagawa, Yuki Saito, Hiroshi Saruwatari

With a conventional speaker-adaptation method, a target speaker's embedding vector is extracted from his/her reference speech using a speaker encoder trained on a speaker-discriminative task.

Paper
Add Code

Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech

no code implementations • 26 Sep 2022 • Yusuke Nakai, Yuki Saito, Kenta Udagawa, Hiroshi Saruwatari

A conventional generative adversarial network (GAN)-based training algorithm significantly improves the quality of synthetic speech by reducing the statistical difference between natural and synthetic speech.

Generative Adversarial Network

Paper
Add Code

Hyperbolic Timbre Embedding for Musical Instrument Sound Synthesis Based on Variational Autoencoders

no code implementations • 27 Sep 2022 • Futa Nakashima, Tomohiko Nakamura, Norihiro Takamune, Satoru Fukayama, Hiroshi Saruwatari

In this paper, we propose a musical instrument sound synthesis (MISS) method based on a variational autoencoder (VAE) that has a hierarchy-inducing latent space for timbre.

Paper
Add Code

Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Personalized Spontaneous Speech Synthesis

1 code implementation • 14 Oct 2022 • Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari

We present a comprehensive empirical study for personalized spontaneous speech synthesis on the basis of linguistic knowledge.

Speech Synthesis Voice Cloning

Paper
Code

JaCappella Corpus: A Japanese a Cappella Vocal Ensemble Corpus

1 code implementation • 29 Nov 2022 • Tomohiko Nakamura, Shinnosuke Takamichi, Naoko Tanji, Satoru Fukayama, Hiroshi Saruwatari

These songs were arranged from out-of-copyright Japanese children's songs and have six voice parts (lead vocal, soprano, alto, tenor, bass, and vocal percussion).

Ranked #1 on Vocal ensemble separation on jaCappella

Vocal ensemble separation

Paper
Code

Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

1 code implementation • 30 Jan 2023 • Takaaki Saeki, Soumi Maiti, Xinjian Li, Shinji Watanabe, Shinnosuke Takamichi, Hiroshi Saruwatari

While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data.

Language Modelling

Paper
Code

Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

no code implementations • 27 Feb 2023 • Dong Yang, Tomoki Koriyama, Yuki Saito, Takaaki Saeki, Detai Xin, Hiroshi Saruwatari

We also leverage duration-aware pause insertion for more natural multi-speaker TTS.

Language Modelling

Paper
Add Code

Kernel interpolation of acoustic transfer functions with adaptive kernel for directed and residual reverberations

no code implementations • 7 Mar 2023 • Juliano G. C. Ribeiro, Shoichi Koyama, Hiroshi Saruwatari

An interpolation method for region-to-region acoustic transfer functions (ATFs) based on kernel ridge regression with an adaptive kernel is proposed.

regression

Paper
Add Code

Spatial Active Noise Control Method Based On Sound Field Interpolation From Reference Microphone Signals

no code implementations • 28 Mar 2023 • Kazuyuki Arikawa, Shoichi Koyama, Hiroshi Saruwatari

A spatial active noise control (ANC) method based on the interpolation of a sound field from reference microphone signals is proposed.

Paper
Add Code

Kernel-interpolation-based spatial active noise control with exterior radiation suppression

no code implementations • 29 Mar 2023 • Kazuyuki Arikawa, Shoichi Koyama, Hiroshi Saruwatari

A spatial active noise control (ANC) method based on kernel interpolation of a sound field with exterior radiation suppression is proposed.

Paper
Add Code

ChatGPT-EDSS: Empathetic Dialogue Speech Synthesis Trained from ChatGPT-derived Context Word Embeddings

no code implementations • 23 May 2023 • Yuki Saito, Shinnosuke Takamichi, Eiji Iimori, Kentaro Tachibana, Hiroshi Saruwatari

We focus on ChatGPT's reading comprehension and introduce it to EDSS, a task of synthesizing speech that can empathize with the interlocutor's emotion.

Chatbot Reading Comprehension +2

Paper
Add Code

CALLS: Japanese Empathetic Dialogue Speech Corpus of Complaint Handling and Attentive Listening in Customer Center

no code implementations • 23 May 2023 • Yuki Saito, Eiji Iimori, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari

We present CALLS, a Japanese speech corpus that considers phone calls in a customer center as a new domain of empathetic spoken dialogue.

Speech Synthesis

Paper
Add Code

How Generative Spoken Language Modeling Encodes Noisy Speech: Investigation from Phonetics to Syntactics

no code implementations • 1 Jun 2023 • Joonyong Park, Shinnosuke Takamichi, Tomohiko Nakamura, Kentaro Seki, Detai Xin, Hiroshi Saruwatari

We examine the speech modeling potential of generative spoken language modeling (GSLM), which involves using learned symbols derived from data rather than phonemes for speech analysis and synthesis.

Language Modelling Resynthesis

Paper
Add Code

Multichannel Active Noise Control with Exterior Radiation Suppression Based on Riemannian Optimization

no code implementations • 15 Jun 2023 • Takaaki Kojima, Kazuyuki Arikawa, Shoichi Koyama, Hiroshi Saruwatari

A multichannel active noise control (ANC) method with exterior radiation suppression is proposed.

Riemannian optimization

Paper
Add Code

Perceptual Quality Enhancement of Sound Field Synthesis Based on Combination of Pressure and Amplitude Matching

no code implementations • 26 Jul 2023 • Keisuke Kimura, Shoichi Koyama, Hiroshi Saruwatari

A sound field synthesis method enhancing perceptual quality is proposed.

Paper
Add Code

Do learned speech symbols follow Zipf's law?

no code implementations • 18 Sep 2023 • Shinnosuke Takamichi, Hiroki Maeda, Joonyong Park, Daisuke Saito, Hiroshi Saruwatari

In this study, we investigate whether speech symbols, learned through deep learning, follow Zipf's law, akin to natural language symbols.

Paper
Add Code

Localizing Acoustic Energy in Sound Field Synthesis by Directionally Weighted Exterior Radiation Suppression

no code implementations • 11 Jan 2024 • Yoshihide Tomita, Shoichi Koyama, Hiroshi Saruwatari

A method for synthesizing the desired sound field while suppressing the exterior radiation power with directional weighting is proposed.

Paper
Add Code

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

no code implementations • 4 Apr 2024 • Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, Sheng Zhao

Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from $68\%$ to $4\%$.

Language Modelling Speech Synthesis +1

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.