Search Results for author: Tomoki Toda

Found 73 papers, 26 papers with code

The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods

no code implementations12 Apr 2018 Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kinnunen, Zhen-Hua Ling

We present the Voice Conversion Challenge 2018, designed as a follow up to the 2016 edition with the aim of providing a common framework for evaluating and comparing different state-of-the-art voice conversion (VC) systems.

Voice Conversion

Multi-Head Decoder for End-to-End Speech Recognition

no code implementations22 Apr 2018 Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda

This paper presents a new network architecture called multi-head decoder for end-to-end speech recognition as an extension of a multi-head attention model.

speech-recognition Speech Recognition

A Spoofing Benchmark for the 2018 Voice Conversion Challenge: Leveraging from Spoofing Countermeasures for Speech Artifact Assessment

no code implementations23 Apr 2018 Tomi Kinnunen, Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Zhen-Hua Ling

As a supplement to subjective results for the 2018 Voice Conversion Challenge (VCC'18) data, we configure a standard constant-Q cepstral coefficient CM to quantify the extent of processing artifacts.

Benchmarking Speaker Verification +1

Frequency domain variants of velvet noise and their application to speech processing and synthesis: with appendices

1 code implementation18 Jun 2018 Hideki Kawahara, Ken-Ichi Sakakibara, Masanori Morise, Hideki Banno, Tomoki Toda, Toshio Irino

One of the proposed variants, FVN (Frequency domain Velvet Noise) applies the procedure to generate a velvet noise on the cyclic frequency domain of DFT (Discrete Fourier Transform).

Sound Audio and Speech Processing Signal Processing

Back-Translation-Style Data Augmentation for End-to-End ASR

no code implementations28 Jul 2018 Tomoki Hayashi, Shinji Watanabe, Yu Zhang, Tomoki Toda, Takaaki Hori, Ramon Astudillo, Kazuya Takeda

In this paper we propose a novel data augmentation method for attention-based end-to-end automatic speech recognition (E2E-ASR), utilizing a large amount of text which is not paired with speech signals.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Refined WaveNet Vocoder for Variational Autoencoder Based Voice Conversion

no code implementations27 Nov 2018 Wen-Chin Huang, Yi-Chiao Wu, Hsin-Te Hwang, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda, Yu Tsao, Hsin-Min Wang

Conventional WaveNet vocoders are trained with natural acoustic features but conditioned on the converted features in the conversion stage for VC, and such a mismatch often causes significant quality and similarity degradation.

Voice Conversion

Quasi-Periodic WaveNet Vocoder: A Pitch Dependent Dilated Convolution Model for Parametric Speech Generation

1 code implementation1 Jul 2019 Yi-Chiao Wu, Tomoki Hayashi, Patrick Lumban Tobing, Kazuhiro Kobayashi, Tomoki Toda

In this paper, we propose a quasi-periodic neural network (QPNet) vocoder with a novel network architecture named pitch-dependent dilated convolution (PDCNN) to improve the pitch controllability of WaveNet (WN) vocoder.

Statistical Voice Conversion with Quasi-Periodic WaveNet Vocoder

1 code implementation21 Jul 2019 Yi-Chiao Wu, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda

However, because of the fixed dilated convolution and generic network architecture, the WN vocoder lacks robustness against unseen input features and often requires a huge network size to achieve acceptable speech quality.

Audio and Speech Processing Sound

Non-Parallel Voice Conversion with Cyclic Variational Autoencoder

2 code implementations24 Jul 2019 Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda

In this work, to overcome this problem, we propose to use CycleVAE-based spectral model that indirectly optimizes the conversion flow by recycling the converted features back into the system to obtain corresponding cyclic reconstructed spectra that can be directly optimized.

Voice Conversion

ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

3 code implementations24 Oct 2019 Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, Takenori Yoshimura, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Yu Zhang, Xu Tan

Furthermore, the unified design enables the integration of ASR functions with TTS, e. g., ASR-based objective evaluation and semi-supervised learning with both ASR and TTS models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining

1 code implementation14 Dec 2019 Wen-Chin Huang, Tomoki Hayashi, Yi-Chiao Wu, Hirokazu Kameoka, Tomoki Toda

We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining.

Voice Conversion

Quasi-Periodic Parallel WaveGAN Vocoder: A Non-autoregressive Pitch-dependent Dilated Convolution Model for Parametric Speech Generation

1 code implementation18 May 2020 Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda

In this paper, we propose a parallel WaveGAN (PWG)-like neural vocoder with a quasi-periodic (QP) architecture to improve the pitch controllability of PWG.

Audio and Speech Processing Sound

Many-to-Many Voice Transformer Network

no code implementations18 May 2020 Hirokazu Kameoka, Wen-Chin Huang, Kou Tanaka, Takuhiro Kaneko, Nobukatsu Hojo, Tomoki Toda

The main idea we propose is an extension of the original VTN that can simultaneously learn mappings among multiple speakers.

Voice Conversion

Quasi-Periodic WaveNet: An Autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network

1 code implementation11 Jul 2020 Yi-Chiao Wu, Tomoki Hayashi, Patrick Lumban Tobing, Kazuhiro Kobayashi, Tomoki Toda

In this paper, a pitch-adaptive waveform generative model named Quasi-Periodic WaveNet (QPNet) is proposed to improve the limited pitch controllability of vanilla WaveNet (WN) using pitch-dependent dilated convolution neural networks (PDCNNs).

Quasi-Periodic Parallel WaveGAN: A Non-autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network

1 code implementation25 Jul 2020 Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda

To improve the pitch controllability and speech modeling capability, we apply a QP structure with PDCNNs to PWG, which introduces pitch information to the network by dynamically changing the network architecture corresponding to the auxiliary $F_{0}$ feature.

Baseline System of Voice Conversion Challenge 2020 with Cyclic Variational Autoencoder and Parallel WaveGAN

1 code implementation9 Oct 2020 Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Toda

In this paper, we present a description of the baseline system of Voice Conversion Challenge (VCC) 2020 with a cyclic variational autoencoder (CycleVAE) and Parallel WaveGAN (PWG), i. e., CycleVAEPWG.

Generative Adversarial Network Task 2 +1

Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations

no code implementations23 Oct 2020 Wen-Chin Huang, Yi-Chiao Wu, Tomoki Hayashi, Tomoki Toda

Given a training dataset of the target speaker, we extract VQW2V and acoustic features to estimate a seq2seq mapping function from the former to the latter.

Voice Conversion

Speech Recognition by Simply Fine-tuning BERT

no code implementations30 Jan 2021 Wen-Chin Huang, Chia-Hua Wu, Shang-Bao Luo, Kuan-Yu Chen, Hsin-Min Wang, Tomoki Toda

We propose a simple method for automatic speech recognition (ASR) by fine-tuning BERT, which is a language model (LM) trained on large-scale unlabeled text data and can generate rich contextual representations.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

The AS-NU System for the M2VoC Challenge

no code implementations7 Apr 2021 Cheng-Hung Hu, Yi-Chiao Wu, Wen-Chin Huang, Yu-Huai Peng, Yu-Wen Chen, Pin-Jui Ku, Tomoki Toda, Yu Tsao, Hsin-Min Wang

The first track focuses on using a small number of 100 target utterances for voice cloning, while the second track focuses on using only 5 target utterances for voice cloning.

Voice Cloning

Unified Source-Filter GAN: Unified Source-filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN

1 code implementation10 Apr 2021 Reo Yoneyama, Yi-Chiao Wu, Tomoki Toda

We propose a unified approach to data-driven source-filter modeling using a single neural network for developing a neural vocoder capable of generating high-quality synthetic speech waveforms while retaining flexibility of the source-filter model to control their voice characteristics.

Non-autoregressive sequence-to-sequence voice conversion

no code implementations14 Apr 2021 Tomoki Hayashi, Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda

This paper proposes a novel voice conversion (VC) method based on non-autoregressive sequence-to-sequence (NAR-S2S) models.

Voice Conversion

Low-Latency Real-Time Non-Parallel Voice Conversion based on Cyclic Variational Autoencoder and Multiband WaveRNN with Data-Driven Linear Prediction

2 code implementations20 May 2021 Patrick Lumban Tobing, Tomoki Toda

To accommodate LLRT constraint with CPU, we propose a novel CycleVAE framework that utilizes mel-spectrogram as spectral features and is built with a sparse network architecture.

Voice Conversion

High-Fidelity and Low-Latency Universal Neural Vocoder based on Multiband WaveRNN with Data-Driven Linear Prediction for Discrete Waveform Modeling

1 code implementation20 May 2021 Patrick Lumban Tobing, Tomoki Toda

This paper presents a novel high-fidelity and low-latency universal neural vocoder framework based on multiband WaveRNN with data-driven linear prediction for discrete waveform modeling (MWDLP).

Low-latency processing

A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion

no code implementations2 Jun 2021 Wen-Chin Huang, Kazuhiro Kobayashi, Yu-Huai Peng, Ching-Feng Liu, Yu Tsao, Hsin-Min Wang, Tomoki Toda

First, a powerful parallel sequence-to-sequence model converts the input dysarthric speech into a normal speech of a reference speaker as an intermediate product, and a nonparallel, frame-wise VC model realized with a variational autoencoder then converts the speaker identity of the reference speech back to that of the patient while assumed to be capable of preserving the enhanced quality.

Voice Conversion

Anomalous Sound Detection Using a Binary Classification Model and Class Centroids

no code implementations11 Jun 2021 Ibuki Kuroyanagi, Tomoki Hayashi, Kazuya Takeda, Tomoki Toda

Our results showed that multi-task learning using binary classification and metric learning to consider the distance from each class centroid in the feature space is effective, and performance can be significantly improved by using even a small amount of anomalous data during training.

Binary Classification Classification +2

On Prosody Modeling for ASR+TTS based Voice Conversion

no code implementations20 Jul 2021 Wen-Chin Huang, Tomoki Hayashi, Xinjian Li, Shinji Watanabe, Tomoki Toda

In voice conversion (VC), an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents; these are then used as input by a text-to-speech (TTS) system to generate the converted speech.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Time Alignment using Lip Images for Frame-based Electrolaryngeal Voice Conversion

no code implementations8 Sep 2021 Yi-Syuan Liou, Wen-Chin Huang, Ming-Chi Yen, Shu-Wei Tsai, Yu-Huai Peng, Tomoki Toda, Yu Tsao, Hsin-Min Wang

Voice conversion (VC) is an effective approach to electrolaryngeal (EL) speech enhancement, a task that aims to improve the quality of the artificial voice from an electrolarynx device.

Dynamic Time Warping Speech Enhancement +1

Generalization Ability of MOS Prediction Networks

1 code implementation6 Oct 2021 Erica Cooper, Wen-Chin Huang, Tomoki Toda, Junichi Yamagishi

Automatic methods to predict listener opinions of synthesized speech remain elusive since listeners, systems being evaluated, characteristics of the speech, and even the instructions given and the rating scale all vary from test to test.

S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations

2 code implementations12 Oct 2021 Wen-Chin Huang, Shu-wen Yang, Tomoki Hayashi, Hung-Yi Lee, Shinji Watanabe, Tomoki Toda

In this work, we provide a series of in-depth analyses by benchmarking on the two tasks in VCC2020, namely intra-/cross-lingual any-to-one (A2O) VC, as well as an any-to-any (A2A) setting.

Benchmarking Voice Conversion

Towards Identity Preserving Normal to Dysarthric Voice Conversion

no code implementations15 Oct 2021 Wen-Chin Huang, Bence Mark Halpern, Lester Phillip Violeta, Odette Scharenborg, Tomoki Toda

We present a voice conversion framework that converts normal speech into dysarthric speech while preserving the speaker identity.

Data Augmentation Decision Making +3

LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech

1 code implementation18 Oct 2021 Wen-Chin Huang, Erica Cooper, Junichi Yamagishi, Tomoki Toda

An effective approach to automatically predict the subjective rating for synthetic speech is to train on a listening test dataset with human-annotated scores.

Voice Conversion

HASA-net: A non-intrusive hearing-aid speech assessment network

no code implementations10 Nov 2021 Hsin-Tien Chiang, Yi-Chiao Wu, Cheng Yu, Tomoki Toda, Hsin-Min Wang, Yih-Chun Hu, Yu Tsao

Without the need of a clean reference, non-intrusive speech assessment methods have caught great attention for objective evaluations.

Unified Source-Filter GAN with Harmonic-plus-Noise Source Excitation Generation

no code implementations12 May 2022 Reo Yoneyama, Yi-Chiao Wu, Tomoki Toda

To improve the source excitation modeling and generated sound quality, a new source excitation generation network separately generating periodic and aperiodic components is proposed.

A Comparative Study of Self-supervised Speech Representation Based Voice Conversion

1 code implementation10 Jul 2022 Wen-Chin Huang, Shu-wen Yang, Tomoki Hayashi, Tomoki Toda

We present a large-scale comparative study of self-supervised speech representation (S3R)-based voice conversion (VC).

Voice Conversion

Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder

no code implementations27 Oct 2022 Reo Yoneyama, Yi-Chiao Wu, Tomoki Toda

Our previous work, the unified source-filter GAN (uSFGAN) vocoder, introduced a novel architecture based on the source-filter theory into the parallel waveform generative adversarial network to achieve high voice quality and pitch controllability.

Generative Adversarial Network

NNSVS: A Neural Network-Based Singing Voice Synthesis Toolkit

2 code implementations28 Oct 2022 Ryuichi Yamamoto, Reo Yoneyama, Tomoki Toda

This paper describes the design of NNSVS, an open-source software for neural network-based singing voice synthesis research.

Singing Voice Synthesis

Analysis of Noisy-target Training for DNN-based speech enhancement

no code implementations2 Nov 2022 Takuya Fujimura, Tomoki Toda

To relax this limitation, Noisy-target Training (NyTT) that utilizes noisy speech as a training target has been proposed.

Speech Enhancement

Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder

no code implementations16 Dec 2022 Yusuke Yasuda, Tomoki Toda

We propose a TTS method based on latent variable conversion using a diffusion probabilistic model and the variational autoencoder (VAE).

Representation Learning Speech Synthesis +1

Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language

no code implementations16 Dec 2022 Yusuke Yasuda, Tomoki Toda

To tackle the challenge of rendering correct pitch accent in Japanese end-to-end TTS, we adopt PnG~BERT, a self-supervised pretrained model in the character and phoneme domain for TTS.

Language Modelling Speech Synthesis +1

The Singing Voice Conversion Challenge 2023

1 code implementation26 Jun 2023 Wen-Chin Huang, Lester Phillip Violeta, Songxiang Liu, Jiatong Shi, Tomoki Toda

A new database was constructed for two tasks, namely in-domain and cross-domain SVC.

Voice Conversion

Preference-based training framework for automatic speech quality assessment using deep neural network

no code implementations29 Aug 2023 Cheng-Hung Hu, Yusuke Yasuda, Tomoki Toda

We propose a training framework of SQA models that can be trained with only preference scores derived from pairs of MOS to improve ranking prediction.

Evaluating Methods for Ground-Truth-Free Foreign Accent Conversion

1 code implementation5 Sep 2023 Wen-Chin Huang, Tomoki Toda

In this work, we evaluate three recently proposed methods for ground-truth-free FAC, where all of them aim to harness the power of sequence-to-sequence (seq2seq) and non-parallel VC models to properly convert the accent and control the speaker identity.

Voice Conversion

Audio Difference Learning for Audio Captioning

no code implementations15 Sep 2023 Tatsuya Komatsu, Yusuke Fujita, Kazuya Takeda, Tomoki Toda

Furthermore, a unique technique is proposed that involves mixing the input audio with additional audio, and using the additional audio as a reference.

Audio captioning

The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains

no code implementations4 Oct 2023 Erica Cooper, Wen-Chin Huang, Yu Tsao, Hsin-Min Wang, Tomoki Toda, Junichi Yamagishi

We present the second edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthesized and processed speech.

Speech Synthesis Text-To-Speech Synthesis

ed-cec: improving rare word recognition using asr postprocessing based on error detection and context-aware error correction

no code implementations8 Oct 2023 Jiajun He, Zekun Yang, Tomoki Toda

Automatic speech recognition (ASR) systems often encounter difficulties in accurately recognizing rare words, leading to errors that can have a negative impact on downstream tasks such as keyword spotting, intent detection, and text summarization.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, ASR Error Detection, and ASR Error Correction

no code implementations24 Jan 2024 Jiajun He, Xiaohan Shi, Xingfeng Li, Tomoki Toda

Therefore, in this paper, we incorporate two auxiliary tasks, ASR error detection (AED) and ASR error correction (AEC), to enhance the semantic coherence of ASR text, and further introduce a novel multi-modal fusion (MF) method to learn shared representations across modalities.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Automatic design optimization of preference-based subjective evaluation with online learning in crowdsourcing environment

no code implementations10 Mar 2024 Yusuke Yasuda, Tomoki Toda

A preference-based subjective evaluation is a key method for evaluating generative media reliably.

Discriminative Neighborhood Smoothing for Generative Anomalous Sound Detection

no code implementations18 Mar 2024 Takuya Fujimura, Keisuke Imoto, Tomoki Toda

We propose discriminative neighborhood smoothing of generative anomaly scores for anomalous sound detection.

Cannot find the paper you are looking for? You can Submit a new open access paper.