Search Results for author: Tomoki Toda

Found 54 papers, 20 papers with code

HASA-net: A non-intrusive hearing-aid speech assessment network

no code implementations10 Nov 2021 Hsin-Tien Chiang, Yi-Chiao Wu, Cheng Yu, Tomoki Toda, Hsin-Min Wang, Yih-Chun Hu, Yu Tsao

Without the need of a clean reference, non-intrusive speech assessment methods have caught great attention for objective evaluations.

Speech Quality

LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech

1 code implementation18 Oct 2021 Wen-Chin Huang, Erica Cooper, Junichi Yamagishi, Tomoki Toda

An effective approach to automatically predict the subjective rating for synthetic speech is to train on a listening test dataset with human-annotated scores.

Voice Conversion

Towards Identity Preserving Normal to Dysarthric Voice Conversion

no code implementations15 Oct 2021 Wen-Chin Huang, Bence Mark Halpern, Lester Phillip Violeta, Odette Scharenborg, Tomoki Toda

We present a voice conversion framework that converts normal speech into dysarthric speech while preserving the speaker identity.

Data Augmentation Decision Making +2

S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations

1 code implementation12 Oct 2021 Wen-Chin Huang, Shu-wen Yang, Tomoki Hayashi, Hung-Yi Lee, Shinji Watanabe, Tomoki Toda

In this work, we provide a series of in-depth analyses by benchmarking on the two tasks in VCC2020, namely intra-/cross-lingual any-to-one (A2O) VC, as well as an any-to-any (A2A) setting.

Voice Conversion

Generalization Ability of MOS Prediction Networks

1 code implementation6 Oct 2021 Erica Cooper, Wen-Chin Huang, Tomoki Toda, Junichi Yamagishi

Automatic methods to predict listener opinions of synthesized speech remain elusive since listeners, systems being evaluated, characteristics of the speech, and even the instructions given and the rating scale all vary from test to test.

Fine-tuning

Time Alignment using Lip Images for Frame-based Electrolaryngeal Voice Conversion

no code implementations8 Sep 2021 Yi-Syuan Liou, Wen-Chin Huang, Ming-Chi Yen, Shu-Wei Tsai, Yu-Huai Peng, Tomoki Toda, Yu Tsao, Hsin-Min Wang

Voice conversion (VC) is an effective approach to electrolaryngeal (EL) speech enhancement, a task that aims to improve the quality of the artificial voice from an electrolarynx device.

Dynamic Time Warping Speech Enhancement +1

On Prosody Modeling for ASR+TTS based Voice Conversion

no code implementations20 Jul 2021 Wen-Chin Huang, Tomoki Hayashi, Xinjian Li, Shinji Watanabe, Tomoki Toda

In voice conversion (VC), an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents; these are then used as input by a text-to-speech (TTS) system to generate the converted speech.

automatic-speech-recognition Speech Recognition +1

Anomalous Sound Detection Using a Binary Classification Model and Class Centroids

no code implementations11 Jun 2021 Ibuki Kuroyanagi, Tomoki Hayashi, Kazuya Takeda, Tomoki Toda

Our results showed that multi-task learning using binary classification and metric learning to consider the distance from each class centroid in the feature space is effective, and performance can be significantly improved by using even a small amount of anomalous data during training.

Classification Metric Learning +1

A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion

no code implementations2 Jun 2021 Wen-Chin Huang, Kazuhiro Kobayashi, Yu-Huai Peng, Ching-Feng Liu, Yu Tsao, Hsin-Min Wang, Tomoki Toda

First, a powerful parallel sequence-to-sequence model converts the input dysarthric speech into a normal speech of a reference speaker as an intermediate product, and a nonparallel, frame-wise VC model realized with a variational autoencoder then converts the speaker identity of the reference speech back to that of the patient while assumed to be capable of preserving the enhanced quality.

Voice Conversion

High-Fidelity and Low-Latency Universal Neural Vocoder based on Multiband WaveRNN with Data-Driven Linear Prediction for Discrete Waveform Modeling

1 code implementation20 May 2021 Patrick Lumban Tobing, Tomoki Toda

This paper presents a novel high-fidelity and low-latency universal neural vocoder framework based on multiband WaveRNN with data-driven linear prediction for discrete waveform modeling (MWDLP).

Low-Latency Real-Time Non-Parallel Voice Conversion based on Cyclic Variational Autoencoder and Multiband WaveRNN with Data-Driven Linear Prediction

2 code implementations20 May 2021 Patrick Lumban Tobing, Tomoki Toda

To accommodate LLRT constraint with CPU, we propose a novel CycleVAE framework that utilizes mel-spectrogram as spectral features and is built with a sparse network architecture.

Fine-tuning Voice Conversion

Non-autoregressive sequence-to-sequence voice conversion

no code implementations14 Apr 2021 Tomoki Hayashi, Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda

This paper proposes a novel voice conversion (VC) method based on non-autoregressive sequence-to-sequence (NAR-S2S) models.

Voice Conversion

Unified Source-Filter GAN: Unified Source-filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN

1 code implementation10 Apr 2021 Reo Yoneyama, Yi-Chiao Wu, Tomoki Toda

We propose a unified approach to data-driven source-filter modeling using a single neural network for developing a neural vocoder capable of generating high-quality synthetic speech waveforms while retaining flexibility of the source-filter model to control their voice characteristics.

Speech Quality

The AS-NU System for the M2VoC Challenge

no code implementations7 Apr 2021 Cheng-Hung Hu, Yi-Chiao Wu, Wen-Chin Huang, Yu-Huai Peng, Yu-Wen Chen, Pin-Jui Ku, Tomoki Toda, Yu Tsao, Hsin-Min Wang

The first track focuses on using a small number of 100 target utterances for voice cloning, while the second track focuses on using only 5 target utterances for voice cloning.

Speech Recognition by Simply Fine-tuning BERT

no code implementations30 Jan 2021 Wen-Chin Huang, Chia-Hua Wu, Shang-Bao Luo, Kuan-Yu Chen, Hsin-Min Wang, Tomoki Toda

We propose a simple method for automatic speech recognition (ASR) by fine-tuning BERT, which is a language model (LM) trained on large-scale unlabeled text data and can generate rich contextual representations.

automatic-speech-recognition Fine-tuning +2

Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations

no code implementations23 Oct 2020 Wen-Chin Huang, Yi-Chiao Wu, Tomoki Hayashi, Tomoki Toda

Given a training dataset of the target speaker, we extract VQW2V and acoustic features to estimate a seq2seq mapping function from the former to the latter.

Voice Conversion

Baseline System of Voice Conversion Challenge 2020 with Cyclic Variational Autoencoder and Parallel WaveGAN

1 code implementation9 Oct 2020 Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Toda

In this paper, we present a description of the baseline system of Voice Conversion Challenge (VCC) 2020 with a cyclic variational autoencoder (CycleVAE) and Parallel WaveGAN (PWG), i. e., CycleVAEPWG.

Voice Conversion

Quasi-Periodic Parallel WaveGAN: A Non-autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network

1 code implementation25 Jul 2020 Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda

To improve the pitch controllability and speech modeling capability, we apply a QP structure with PDCNNs to PWG, which introduces pitch information to the network by dynamically changing the network architecture corresponding to the auxiliary $F_{0}$ feature.

Quasi-Periodic WaveNet: An Autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network

1 code implementation11 Jul 2020 Yi-Chiao Wu, Tomoki Hayashi, Patrick Lumban Tobing, Kazuhiro Kobayashi, Tomoki Toda

In this paper, a pitch-adaptive waveform generative model named Quasi-Periodic WaveNet (QPNet) is proposed to improve the limited pitch controllability of vanilla WaveNet (WN) using pitch-dependent dilated convolution neural networks (PDCNNs).

Many-to-Many Voice Transformer Network

no code implementations18 May 2020 Hirokazu Kameoka, Wen-Chin Huang, Kou Tanaka, Takuhiro Kaneko, Nobukatsu Hojo, Tomoki Toda

The main idea we propose is an extension of the original VTN that can simultaneously learn mappings among multiple speakers.

Voice Conversion

Quasi-Periodic Parallel WaveGAN Vocoder: A Non-autoregressive Pitch-dependent Dilated Convolution Model for Parametric Speech Generation

1 code implementation18 May 2020 Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda

In this paper, we propose a parallel WaveGAN (PWG)-like neural vocoder with a quasi-periodic (QP) architecture to improve the pitch controllability of PWG.

Audio and Speech Processing Sound

Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining

no code implementations14 Dec 2019 Wen-Chin Huang, Tomoki Hayashi, Yi-Chiao Wu, Hirokazu Kameoka, Tomoki Toda

We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining.

Voice Conversion

ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

3 code implementations24 Oct 2019 Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, Takenori Yoshimura, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Yu Zhang, Xu Tan

Furthermore, the unified design enables the integration of ASR functions with TTS, e. g., ASR-based objective evaluation and semi-supervised learning with both ASR and TTS models.

automatic-speech-recognition Speech Recognition

Non-Parallel Voice Conversion with Cyclic Variational Autoencoder

2 code implementations24 Jul 2019 Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda

In this work, to overcome this problem, we propose to use CycleVAE-based spectral model that indirectly optimizes the conversion flow by recycling the converted features back into the system to obtain corresponding cyclic reconstructed spectra that can be directly optimized.

Voice Conversion

Statistical Voice Conversion with Quasi-Periodic WaveNet Vocoder

1 code implementation21 Jul 2019 Yi-Chiao Wu, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda

However, because of the fixed dilated convolution and generic network architecture, the WN vocoder lacks robustness against unseen input features and often requires a huge network size to achieve acceptable speech quality.

Audio and Speech Processing Sound

Quasi-Periodic WaveNet Vocoder: A Pitch Dependent Dilated Convolution Model for Parametric Speech Generation

1 code implementation1 Jul 2019 Yi-Chiao Wu, Tomoki Hayashi, Patrick Lumban Tobing, Kazuhiro Kobayashi, Tomoki Toda

In this paper, we propose a quasi-periodic neural network (QPNet) vocoder with a novel network architecture named pitch-dependent dilated convolution (PDCNN) to improve the pitch controllability of WaveNet (WN) vocoder.

Refined WaveNet Vocoder for Variational Autoencoder Based Voice Conversion

no code implementations27 Nov 2018 Wen-Chin Huang, Yi-Chiao Wu, Hsin-Te Hwang, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda, Yu Tsao, Hsin-Min Wang

Conventional WaveNet vocoders are trained with natural acoustic features but conditioned on the converted features in the conversion stage for VC, and such a mismatch often causes significant quality and similarity degradation.

Voice Conversion

Back-Translation-Style Data Augmentation for End-to-End ASR

no code implementations28 Jul 2018 Tomoki Hayashi, Shinji Watanabe, Yu Zhang, Tomoki Toda, Takaaki Hori, Ramon Astudillo, Kazuya Takeda

In this paper we propose a novel data augmentation method for attention-based end-to-end automatic speech recognition (E2E-ASR), utilizing a large amount of text which is not paired with speech signals.

automatic-speech-recognition Data Augmentation +4

Frequency domain variants of velvet noise and their application to speech processing and synthesis: with appendices

1 code implementation18 Jun 2018 Hideki Kawahara, Ken-Ichi Sakakibara, Masanori Morise, Hideki Banno, Tomoki Toda, Toshio Irino

One of the proposed variants, FVN (Frequency domain Velvet Noise) applies the procedure to generate a velvet noise on the cyclic frequency domain of DFT (Discrete Fourier Transform).

Sound Audio and Speech Processing Signal Processing

A Spoofing Benchmark for the 2018 Voice Conversion Challenge: Leveraging from Spoofing Countermeasures for Speech Artifact Assessment

no code implementations23 Apr 2018 Tomi Kinnunen, Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Zhen-Hua Ling

As a supplement to subjective results for the 2018 Voice Conversion Challenge (VCC'18) data, we configure a standard constant-Q cepstral coefficient CM to quantify the extent of processing artifacts.

Speaker Verification Voice Conversion

Multi-Head Decoder for End-to-End Speech Recognition

no code implementations22 Apr 2018 Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda

This paper presents a new network architecture called multi-head decoder for end-to-end speech recognition as an extension of a multi-head attention model.

End-To-End Speech Recognition Speech Recognition

The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods

no code implementations12 Apr 2018 Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kinnunen, Zhen-Hua Ling

We present the Voice Conversion Challenge 2018, designed as a follow up to the 2016 edition with the aim of providing a common framework for evaluating and comparing different state-of-the-art voice conversion (VC) systems.

Voice Conversion

Cannot find the paper you are looking for? You can Submit a new open access paper.