Search Results for author: Junichi Yamagishi

Found 126 papers, 49 papers with code

Mitigating the Diminishing Effect of Elastic Weight Consolidation

no code implementations COLING 2022 Canasai Kruengkrai, Junichi Yamagishi

Elastic weight consolidation (EWC, Kirkpatrick et al. 2017) is a promising approach to addressing catastrophic forgetting in sequential training.

Fact Checking Natural Language Inference

Revisiting and Improving Scoring Fusion for Spoofing-aware Speaker Verification Using Compositional Data Analysis

1 code implementation16 Jun 2024 Xin Wang, Tomi Kinnunen, Kong Aik Lee, Paul-Gauthier Noé, Junichi Yamagishi

The outcomes of these findings, namely, the score calibration before fusion, improved linear fusion, and better non-linear fusion, were found to be effective on the SASV challenge database.

Speaker Verification

Spoof Diarization: "What Spoofed When" in Partially Spoofed Audio

1 code implementation12 Jun 2024 Lin Zhang, Xin Wang, Erica Cooper, Mireia Diez, Federico Landini, Nicholas Evans, Junichi Yamagishi

As a pioneering study in spoof diarization, we focus on defining the task, establishing evaluation metrics, and proposing a benchmark model, namely the Countermeasure-Condition Clustering (3C) model.


Target Speaker Extraction with Curriculum Learning

no code implementations12 Jun 2024 Yun Liu, Xuechen Liu, Xiaoxiao Miao, Junichi Yamagishi

This paper presents a novel approach to target speaker extraction (TSE) using Curriculum Learning (CL) techniques, addressing the challenge of distinguishing a target speaker's voice from a mixture containing interfering speakers.

Target Speaker Extraction

To what extent can ASV systems naturally defend against spoofing attacks?

no code implementations8 Jun 2024 Jee-weon Jung, Xin Wang, Nicholas Evans, Shinji Watanabe, Hye-jin Shim, Hemlata Tak, Sidhhant Arora, Junichi Yamagishi, Joon Son Chung

The current automatic speaker verification (ASV) task involves making binary decisions on two types of trials: target and non-target.

Speaker Verification

Exploring Self-Supervised Vision Transformers for Deepfake Detection: A Comparative Analysis

no code implementations1 May 2024 Huy H. Nguyen, Junichi Yamagishi, Isao Echizen

This paper investigates the effectiveness of self-supervised pre-trained transformers compared to supervised pre-trained transformers and conventional neural networks (ConvNets) for detecting various types of deepfakes.

DeepFake Detection Face Swapping +3

The VoicePrivacy 2024 Challenge Evaluation Plan

1 code implementation3 Apr 2024 Natalia Tomashenko, Xiaoxiao Miao, Pierre Champion, Sarina Meyer, Xin Wang, Emmanuel Vincent, Michele Panariello, Nicholas Evans, Junichi Yamagishi, Massimiliano Todisco

The task of the challenge is to develop a voice anonymization system for speech data which conceals the speaker's voice identity while protecting linguistic content and emotional states.

Bridging Textual and Tabular Worlds for Fact Verification: A Lightweight, Attention-Based Model

1 code implementation26 Mar 2024 Shirin Dabbaghi Varnosfaderani, Canasai Kruengkrai, Ramin Yahyapour, Junichi Yamagishi

FEVEROUS is a benchmark and research initiative focused on fact extraction and verification tasks involving unstructured text and structured tabular data.

Fact Verification

Uncertainty as a Predictor: Leveraging Self-Supervised Learning for Zero-Shot MOS Prediction

no code implementations25 Dec 2023 Aditya Ravuri, Erica Cooper, Junichi Yamagishi

Predicting audio quality in voice synthesis and conversion systems is a critical yet challenging task, especially when traditional methods like Mean Opinion Scores (MOS) are cumbersome to collect at scale.

Self-Supervised Learning

XFEVER: Exploring Fact Verification across Languages

1 code implementation25 Oct 2023 Yi-Chen Chang, Canasai Kruengkrai, Junichi Yamagishi

Experimental results show that the multilingual language model can be used to build fact verification models in different languages efficiently.

Benchmarking Fact Verification +3

The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains

no code implementations4 Oct 2023 Erica Cooper, Wen-Chin Huang, Yu Tsao, Hsin-Min Wang, Tomoki Toda, Junichi Yamagishi

We present the second edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthesized and processed speech.

Speech Synthesis Text-To-Speech Synthesis

How Close are Other Computer Vision Tasks to Deepfake Detection?

no code implementations2 Oct 2023 Huy H. Nguyen, Junichi Yamagishi, Isao Echizen

In this paper, we challenge the conventional belief that supervised ImageNet-trained models have strong generalizability and are suitable for use as feature extractors in deepfake detection.

DeepFake Detection Face Recognition +1

Can large-scale vocoded spoofed data improve speech spoofing countermeasure with a self-supervised front end?

1 code implementation12 Sep 2023 Xin Wang, Junichi Yamagishi

While many datasets use spoofed data generated by speech synthesis systems, it was recently found that data vocoded by neural vocoders were also effective as the spoofed training data.

Self-Supervised Learning Speech Synthesis

BodyFormer: Semantics-guided 3D Body Gesture Synthesis with Transformer

no code implementations7 Sep 2023 Kunkun Pang, Dafei Qin, Yingruo Fan, Julian Habekost, Takaaki Shiratori, Junichi Yamagishi, Taku Komura

Learning the mapping between speech and 3D full-body gestures is difficult due to the stochastic nature of the problem and the lack of a rich cross-modal dataset that is needed for training.

Towards single integrated spoofing-aware speaker verification embeddings

1 code implementation30 May 2023 Sung Hwan Mun, Hye-jin Shim, Hemlata Tak, Xin Wang, Xuechen Liu, Md Sahidullah, Myeonghun Jeong, Min Hyun Han, Massimiliano Todisco, Kong Aik Lee, Junichi Yamagishi, Nicholas Evans, Tomi Kinnunen, Nam Soo Kim, Jee-weon Jung

Second, competitive performance should be demonstrated compared to the fusion of automatic speaker verification (ASV) and countermeasure (CM) embeddings, which outperformed single embedding solutions by a large margin in the SASV2022 challenge.

Speaker Verification

Range-Based Equal Error Rate for Spoof Localization

1 code implementation28 May 2023 Lin Zhang, Xin Wang, Erica Cooper, Nicholas Evans, Junichi Yamagishi

To properly measure misclassified ranges and better evaluate spoof localization performance, we upgrade point-based EER to range-based EER.

Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech

no code implementations17 May 2023 Erica Cooper, Junichi Yamagishi

Mean Opinion Score (MOS) is a popular measure for evaluating synthesized speech.

Hiding speaker's sex in speech using zero-evidence speaker representation in an analysis/synthesis pipeline

1 code implementation29 Nov 2022 Paul-Gauthier Noé, Xiaoxiao Miao, Xin Wang, Junichi Yamagishi, Jean-François Bonastre, Driss Matrouf

The use of modern vocoders in an analysis/synthesis pipeline allows us to investigate high-quality voice conversion that can be used for privacy purposes.

Voice Conversion

Outlier-Aware Training for Improving Group Accuracy Disparities

no code implementations27 Oct 2022 Li-Kuang Chen, Canasai Kruengkrai, Junichi Yamagishi

Methods addressing spurious correlations such as Just Train Twice (JTT, arXiv:2107. 09044v2) involve reweighting a subset of the training set to maximize the worst-group accuracy.

Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders

1 code implementation19 Oct 2022 Xin Wang, Junichi Yamagishi

To make better use of pairs of bona fide and spoofed data, this study introduces a contrastive feature loss that can be plugged into the standard training criterion.

Analysis of Master Vein Attacks on Finger Vein Recognition Systems

no code implementations18 Oct 2022 Huy H. Nguyen, Trung-Nghia Le, Junichi Yamagishi, Isao Echizen

The results raise the alarm about the robustness of such systems and suggest that master vein attacks should be considered an important security measure.

Finger Vein Recognition

Joint Speaker Encoder and Neural Back-end Model for Fully End-to-End Automatic Speaker Verification with Multiple Enrollment Utterances

no code implementations1 Sep 2022 Chang Zeng, Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi

Conventional automatic speaker verification systems can usually be decomposed into a front-end model such as time delay neural network (TDNN) for extracting speaker embeddings and a back-end model such as statistics-based probabilistic linear discriminant analysis (PLDA) or neural network-based neural PLDA (NPLDA) for similarity scoring.

Data Augmentation Speaker Verification

Spoofing-Aware Attention based ASV Back-end with Multiple Enrollment Utterances and a Sampling Strategy for the SASV Challenge 2022

no code implementations1 Sep 2022 Chang Zeng, Lin Zhang, Meng Liu, Junichi Yamagishi

Current state-of-the-art automatic speaker verification (ASV) systems are vulnerable to presentation attacks, and several countermeasures (CMs), which distinguish bona fide trials from spoofing ones, have been explored to protect ASV.

Speaker Verification

The VoicePrivacy 2020 Challenge Evaluation Plan

1 code implementation14 May 2022 Natalia Tomashenko, Brij Mohan Lal Srivastava, Xin Wang, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, Massimiliano Todisco

The VoicePrivacy Challenge aims to promote the development of privacy preservation tools for speech technology by gathering a new community to define the tasks of interest and the evaluation methodology, and benchmarking solutions through a series of challenges.


The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance

no code implementations11 Apr 2022 Lin Zhang, Xin Wang, Erica Cooper, Nicholas Evans, Junichi Yamagishi

Since the short spoofed speech segments to be embedded by attackers are of variable length, six different temporal resolutions are considered, ranging from as short as 20 ms to as large as 640 ms. Third, we propose a new CM that enables the simultaneous use of the segment-level labels at different temporal resolutions as well as utterance-level labels to execute utterance- and segment-level detection at the same time.

Speaker Verification Speech Synthesis +2

The VoicePrivacy 2022 Challenge Evaluation Plan

1 code implementation23 Mar 2022 Natalia Tomashenko, Xin Wang, Xiaoxiao Miao, Hubert Nourtel, Pierre Champion, Massimiliano Todisco, Emmanuel Vincent, Nicholas Evans, Junichi Yamagishi, Jean-François Bonastre

Participants apply their developed anonymization systems, run evaluation scripts and submit objective evaluation results and anonymized speech data to the organizers.

Speaker Verification

Joint Noise Reduction and Listening Enhancement for Full-End Speech Enhancement

no code implementations22 Mar 2022 Haoyu Li, Yun Liu, Junichi Yamagishi

Speech enhancement (SE) methods mainly focus on recovering clean speech from noisy input.

Speech Enhancement

Robust Deepfake On Unrestricted Media: Generation And Detection

no code implementations13 Feb 2022 Trung-Nghia Le, Huy H Nguyen, Junichi Yamagishi, Isao Echizen

Recent advances in deep learning have led to substantial improvements in deepfake generation, resulting in fake media with a more realistic appearance.

DeepFake Detection Face Swapping

Optimizing Tandem Speaker Verification and Anti-Spoofing Systems

no code implementations24 Jan 2022 Anssi Kanervisto, Ville Hautamäki, Tomi Kinnunen, Junichi Yamagishi

As automatic speaker verification (ASV) systems are vulnerable to spoofing attacks, they are typically used in conjunction with spoofing countermeasure (CM) systems to improve security.

Speaker Verification

A Practical Guide to Logical Access Voice Presentation Attack Detection

1 code implementation10 Jan 2022 Xin Wang, Junichi Yamagishi

Presentation attack detection (PAD) for ASV, or speech anti-spoofing, is therefore indispensable.

Artifact Detection Speaker Verification +2

Investigating self-supervised front ends for speech spoofing countermeasures

1 code implementation15 Nov 2021 Xin Wang, Junichi Yamagishi

Self-supervised speech model is a rapid progressing research topic, and many pre-trained models have been released and used in various down stream tasks.

Face Swapping

LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech

1 code implementation18 Oct 2021 Wen-Chin Huang, Erica Cooper, Junichi Yamagishi, Tomoki Toda

An effective approach to automatically predict the subjective rating for synthetic speech is to train on a listening test dataset with human-annotated scores.

Voice Conversion

Revisiting Speech Content Privacy

no code implementations13 Oct 2021 Jennifer Williams, Junichi Yamagishi, Paul-Gauthier Noe, Cassia Valentini Botinhao, Jean-Francois Bonastre

In this paper, we discuss an important aspect of speech privacy: protecting spoken content.

Estimating the confidence of speech spoofing countermeasure

1 code implementation10 Oct 2021 Xin Wang, Junichi Yamagishi

On the ASVspoof2019 logical access database, the results demonstrate that an energy-based estimator and a neural-network-based one achieved acceptable performance in identifying unknown attacks in the test set.

Generalization Ability of MOS Prediction Networks

1 code implementation6 Oct 2021 Erica Cooper, Wen-Chin Huang, Tomoki Toda, Junichi Yamagishi

Automatic methods to predict listener opinions of synthesized speech remain elusive since listeners, systems being evaluated, characteristics of the speech, and even the instructions given and the rating scale all vary from test to test.

DDS: A new device-degraded speech dataset for speech enhancement

no code implementations16 Sep 2021 Haoyu Li, Junichi Yamagishi

A large and growing amount of speech content in real-life scenarios is being recorded on consumer-grade devices in uncontrolled environments, resulting in degraded speech quality.

Speech Enhancement

Master Face Attacks on Face Recognition Systems

no code implementations8 Sep 2021 Huy H. Nguyen, Sébastien Marcel, Junichi Yamagishi, Isao Echizen

Previous work has proven the existence of master faces, i. e., faces that match multiple enrolled templates in face recognition systems, and their existence extends the ability of presentation attacks.

Face Recognition

ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection

no code implementations1 Sep 2021 Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, Héctor Delgado

In addition to a continued focus upon logical and physical access tasks in which there are a number of advances compared to previous editions, ASVspoof 2021 introduces a new task involving deepfake speech detection.

Face Swapping Speaker Verification

ASVspoof 2021: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan

1 code implementation1 Sep 2021 Héctor Delgado, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Xuechen Liu, Andreas Nautsch, Jose Patino, Md Sahidullah, Massimiliano Todisco, Xin Wang, Junichi Yamagishi

The automatic speaker verification spoofing and countermeasures (ASVspoof) challenge series is a community-led initiative which aims to promote the consideration of spoofing and the development of countermeasures.

Face Swapping Speaker Verification

OpenForensics: Large-Scale Challenging Dataset For Multi-Face Forgery Detection And Segmentation In-The-Wild

no code implementations ICCV 2021 Trung-Nghia Le, Huy H. Nguyen, Junichi Yamagishi, Isao Echizen

To promote these new tasks, we have created the first large-scale dataset posing a high level of challenges that is designed with face-wise rich annotations explicitly for face forgery detection and segmentation, namely OpenForensics.

Face Detection Face Swapping +1

Use of speaker recognition approaches for learning and evaluating embedding representations of musical instrument sounds

1 code implementation24 Jul 2021 Xuan Shi, Erica Cooper, Junichi Yamagishi

Constructing an embedding space for musical instrument sounds that can meaningfully represent new and unseen instruments is important for downstream music generation tasks such as multi-instrument synthesis and timbre transfer.

Data Augmentation Instrument Recognition +4

SVSNet: An End-to-end Speaker Voice Similarity Assessment Model

no code implementations20 Jul 2021 Cheng-Hung Hu, Yu-Huai Peng, Junichi Yamagishi, Yu Tsao, Hsin-Min Wang

Neural evaluation metrics derived for numerous speech generation tasks have recently attracted great attention.

Voice Conversion Voice Similarity

Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance

no code implementations25 Jun 2021 Hieu-Thi Luong, Junichi Yamagishi

Generally speaking, the main objective when training a neural speech synthesis system is to synthesize natural and expressive speech from the output layer of the neural network without much attention given to the hidden layers.

Quantization Speech Synthesis +1

Visualizing Classifier Adjacency Relations: A Case Study in Speaker Verification and Voice Anti-Spoofing

1 code implementation11 Jun 2021 Tomi Kinnunen, Andreas Nautsch, Md Sahidullah, Nicholas Evans, Xin Wang, Massimiliano Todisco, Héctor Delgado, Junichi Yamagishi, Kong Aik Lee

Whether it be for results summarization, or the analysis of classifier fusion, some means to compare different classifiers can often provide illuminating insight into their behaviour, (dis)similarity or complementarity.

Speaker Verification Voice Anti-spoofing

A Multi-Level Attention Model for Evidence-Based Fact Checking

2 code implementations Findings (ACL) 2021 Canasai Kruengkrai, Junichi Yamagishi, Xin Wang

Evidence-based fact checking aims to verify the truthfulness of a claim against evidence extracted from textual sources.

Fact Checking Sentence

Exploring Disentanglement with Multilingual and Monolingual VQ-VAE

1 code implementation4 May 2021 Jennifer Williams, Jason Fong, Erica Cooper, Junichi Yamagishi

This work examines the content and usefulness of disentangled phone and speaker representations from two separately trained VQ-VAE systems: one trained on multilingual data and another trained on monolingual data.


Multi-Metric Optimization using Generative Adversarial Networks for Near-End Speech Intelligibility Enhancement

1 code implementation17 Apr 2021 Haoyu Li, Junichi Yamagishi

The intelligibility of speech severely degrades in the presence of environmental noise and reverberation.

Fashion-Guided Adversarial Attack on Person Segmentation

1 code implementation17 Apr 2021 Marc Treu, Trung-Nghia Le, Huy H. Nguyen, Junichi Yamagishi, Isao Echizen

It generates adversarial textures learned from fashion style images and then overlays them on the clothing regions in the original image to make all persons in the image invisible to person segmentation networks.

Adversarial Attack Human Instance Segmentation +2

An Initial Investigation for Detecting Partially Spoofed Audio

no code implementations6 Apr 2021 Lin Zhang, Xin Wang, Erica Cooper, Junichi Yamagishi, Jose Patino, Nicholas Evans

By definition, partially-spoofed utterances contain a mix of both spoofed and bona fide segments, which will likely degrade the performance of countermeasures trained with entirely spoofed utterances.

Voice Anti-spoofing

Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances

1 code implementation4 Apr 2021 Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi

Probabilistic linear discriminant analysis (PLDA) or cosine similarity have been widely used in traditional speaker verification systems as back-end techniques to measure pairwise similarities.

Speaker Verification

Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis

no code implementations10 Nov 2020 Erica Cooper, Xin Wang, Yi Zhao, Yusuke Yasuda, Junichi Yamagishi

We explore pretraining strategies including choice of base corpus with the aim of choosing the best strategy for zero-shot multi-speaker end-to-end synthesis.

Speech Synthesis

An Investigation of the Relation Between Grapheme Embeddings and Pronunciation for Tacotron-based Systems

no code implementations21 Oct 2020 Antoine Perquin, Erica Cooper, Junichi Yamagishi

Thanks to this property, we show that grapheme embeddings learned by Tacotron models can be useful for tasks such as grapheme-to-phoneme conversion and control of the pronunciation in synthetic speech.

Relation Speech Synthesis +1

Learning Disentangled Phone and Speaker Representations in a Semi-Supervised VQ-VAE Paradigm

1 code implementation21 Oct 2020 Jennifer Williams, Yi Zhao, Erica Cooper, Junichi Yamagishi

Additionally, phones can be recognized from sub-phone VQ codebook indices in our semi-supervised VQ-VAE better than self-supervised with global conditions.

speaker-diarization Speaker Diarization +1

End-to-End Text-to-Speech using Latent Duration based on VQ-VAE

no code implementations19 Oct 2020 Yusuke Yasuda, Xin Wang, Junichi Yamagishi

Explicit duration modeling is a key to achieving robust and efficient alignment in text-to-speech synthesis (TTS).

Speech Synthesis Text-To-Speech Synthesis

Latent linguistic embedding for cross-lingual text-to-speech and voice conversion

no code implementations8 Oct 2020 Hieu-Thi Luong, Junichi Yamagishi

As the recently proposed voice cloning system, NAUTILUS, is capable of cloning unseen voices using untranscribed speech, we investigate the feasibility of using it to develop a unified cross-lingual TTS/VC system.

Voice Cloning Voice Conversion

Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification: Fundamentals

no code implementations12 Jul 2020 Tomi Kinnunen, Héctor Delgado, Nicholas Evans, Kong Aik Lee, Ville Vestman, Andreas Nautsch, Massimiliano Todisco, Xin Wang, Md Sahidullah, Junichi Yamagishi, Douglas A. Reynolds

Recent years have seen growing efforts to develop spoofing countermeasures (CMs) to protect automatic speaker verification (ASV) systems from being deceived by manipulated or artificial inputs.

Speaker Verification

Generating Master Faces for Use in Performing Wolf Attacks on Face Recognition Systems

no code implementations15 Jun 2020 Huy H. Nguyen, Junichi Yamagishi, Isao Echizen, Sébastien Marcel

In this work, we demonstrated that wolf (generic) faces, which we call "master faces," can also compromise face recognition systems and that the master face concept can be generalized in some cases.

Face Recognition

NAUTILUS: a Versatile Voice Cloning System

no code implementations22 May 2020 Hieu-Thi Luong, Junichi Yamagishi

By using a multi-speaker speech corpus to train all requisite encoders and decoders in the initial training stage, our system can clone unseen voices using untranscribed speech of target speakers on the basis of the backpropagation algorithm.

Speech Synthesis Voice Cloning +1

Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis

no code implementations20 May 2020 Yusuke Yasuda, Xin Wang, Junichi Yamagishi

Our experiments suggest that a) a neural sequence-to-sequence TTS system should have a sufficient number of model parameters to produce high quality speech, b) it should also use a powerful encoder when it takes characters as inputs, and c) the encoder still has a room for improvement and needs to have an improved architecture to learn supra-segmental features more appropriately.

Speech Synthesis Text-To-Speech Synthesis

The Privacy ZEBRA: Zero Evidence Biometric Recognition Assessment

2 code implementations19 May 2020 Andreas Nautsch, Jose Patino, Natalia Tomashenko, Junichi Yamagishi, Paul-Gauthier Noe, Jean-Francois Bonastre, Massimiliano Todisco, Nicholas Evans

Mounting privacy legislation calls for the preservation of privacy in speech technology, though solutions are gravely lacking.

Cryptography and Security Audio and Speech Processing

Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?

1 code implementation4 May 2020 Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Junichi Yamagishi

This is followed by an analysis on synthesis quality, speaker and dialect similarity, and a remark on the effectiveness of our speaker augmentation approach.

Speech Synthesis

Introducing the VoicePrivacy Initiative

3 code implementations4 May 2020 Natalia Tomashenko, Brij Mohan Lal Srivastava, Xin Wang, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, Massimiliano Todisco

The VoicePrivacy initiative aims to promote the development of privacy preservation tools for speech technology by gathering a new community to define the tasks of interest and the evaluation methodology, and benchmarking solutions through a series of challenges.


Noise Tokens: Learning Neural Noise Templates for Environment-Aware Speech Enhancement

no code implementations8 Apr 2020 Haoyu Li, Junichi Yamagishi

In recent years, speech enhancement (SE) has achieved impressive progress with the success of deep neural networks (DNNs).

Audio and Speech Processing

iMetricGAN: Intelligibility Enhancement for Speech-in-Noise using Generative Adversarial Network-based Metric Learning

1 code implementation Interspeech 2020 Haoyu Li, Szu-Wei Fu, Yu Tsao, Junichi Yamagishi

The intelligibility of natural speech is seriously degraded when exposed to adverse noisy environments.

Audio and Speech Processing Sound

Detecting and Correcting Adversarial Images Using Image Processing Operations

no code implementations11 Dec 2019 Huy H. Nguyen, Minoru Kuribayashi, Junichi Yamagishi, Isao Echizen

Deep neural networks (DNNs) have achieved excellent performance on several tasks and have been widely applied in both academia and industry.

BIG-bench Machine Learning Object Recognition

Transformation of low-quality device-recorded speech to high-quality speech using improved SEGAN model

1 code implementation10 Nov 2019 Seyyed Saeed Sarfjoo, Xin Wang, Gustav Eje Henter, Jaime Lorenzo-Trueba, Shinji Takaki, Junichi Yamagishi

Nowadays vast amounts of speech data are recorded from low-quality recorder devices such as smartphones, tablets, laptops, and medium-quality microphones.

Sound Audio and Speech Processing

A Method for Identifying Origin of Digital Images Using a Convolution Neural Network

no code implementations2 Nov 2019 Rong Huang, Fuming Fang, Huy H. Nguyen, Junichi Yamagishi, Isao Echizen

The rapid development of deep learning techniques has created new challenges in identifying the origin of digital images because generative adversarial networks and variational autoencoders can create plausible digital images whose contents are not present in natural scenes.

Security of Facial Forensics Models Against Adversarial Attacks

no code implementations2 Nov 2019 Rong Huang, Fuming Fang, Huy H. Nguyen, Junichi Yamagishi, Isao Echizen

We experimentally demonstrated the existence of individual adversarial perturbations (IAPs) and universal adversarial perturbations (UAPs) that can lead a well-performed FFM to misbehave.

Use of a Capsule Network to Detect Fake Images and Videos

2 code implementations28 Oct 2019 Huy H. Nguyen, Junichi Yamagishi, Isao Echizen

In this paper, we introduce a capsule network that can detect various kinds of attacks, from presentation attacks using printed images and replayed videos to attacks using fake videos created using deep learning.

Image and Video Forgery Detection

Transferring neural speech waveform synthesizers to musical instrument sounds generation

no code implementations27 Oct 2019 Yi Zhao, Xin Wang, Lauri Juvela, Junichi Yamagishi

Recent neural waveform synthesizers such as WaveNet, WaveGlow, and the neural-source-filter (NSF) model have shown good performance in speech synthesis despite their different methods of waveform generation.

Audio Generation Audio Synthesis +2

Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural Speaker Embeddings

3 code implementations23 Oct 2019 Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Fuming Fang, Xin Wang, Nanxin Chen, Junichi Yamagishi

While speaker adaptation for end-to-end speech synthesis using speaker embeddings can produce good speaker similarity for speakers seen during training, there remains a gap for zero-shot adaptation to unseen speakers.

Audio and Speech Processing

Bootstrapping non-parallel voice conversion from speaker-adaptive text-to-speech

no code implementations14 Sep 2019 Hieu-Thi Luong, Junichi Yamagishi

Voice conversion (VC) and text-to-speech (TTS) are two tasks that share a similar objective, generating speech with a target voice.

Voice Conversion

Initial investigation of an encoder-decoder end-to-end TTS framework using marginalization of monotonic hard latent alignments

no code implementations30 Aug 2019 Yusuke Yasuda, Xin Wang, Junichi Yamagishi

The advantages of our approach are that we can simplify many modules for the soft attention and that we can train the end-to-end TTS model using a single likelihood function.


Generating Sentiment-Preserving Fake Online Reviews Using Neural Language Models and Their Human- and Machine-based Detection

no code implementations22 Jul 2019 David Ifeoluwa Adelani, Haotian Mai, Fuming Fang, Huy H. Nguyen, Junichi Yamagishi, Isao Echizen

Advanced neural language models (NLMs) are widely used in sequence generation tasks because they are able to produce fluent and meaningful sentences.

A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation

no code implementations18 Jun 2019 Hieu-Thi Luong, Junichi Yamagishi

In this study, we propose a novel speech synthesis model, which can be adapted to unseen speakers by fine-tuning part of or all of the network using either transcribed or untranscribed speech.

Decoder Speech Synthesis

Multi-task Learning For Detecting and Segmenting Manipulated Facial Images and Videos

1 code implementation17 Jun 2019 Huy H. Nguyen, Fuming Fang, Junichi Yamagishi, Isao Echizen

The output of one branch of the decoder is used for segmenting the manipulated regions while that of the other branch is used for reconstructing the input, which helps improve overall performance.

Binary Classification Decoder +3

Neural source-filter waveform models for statistical parametric speech synthesis

no code implementations27 Apr 2019 Xin Wang, Shinji Takaki, Junichi Yamagishi

Other models such as Parallel WaveNet and ClariNet bring together the benefits of AR and IAF-based models and train an IAF model by transferring the knowledge from a pre-trained AR teacher to an IAF student without any sequential transformation.

Speech Synthesis

MOSNet: Deep Learning based Objective Assessment for Voice Conversion

6 code implementations17 Apr 2019 Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Junichi Yamagishi, Yu Tsao, Hsin-Min Wang

In this paper, we propose deep learning-based assessment models to predict human ratings of converted speech.

Voice Conversion

GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram

1 code implementation8 Apr 2019 Lauri Juvela, Bajibabu Bollepalli, Junichi Yamagishi, Paavo Alku

Recent advances in neural network -based text-to-speech have reached human level naturalness in synthetic speech.

Speech Synthesis

Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora

no code implementations1 Apr 2019 Hieu-Thi Luong, Xin Wang, Junichi Yamagishi, Nobuyuki Nishizawa

When the available data of a target speaker is insufficient to train a high quality speaker-dependent neural text-to-speech (TTS) system, we can combine data from multiple speakers and train a multi-speaker TTS model instead.

Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet

no code implementations29 Mar 2019 Mingyang Zhang, Xin Wang, Fuming Fang, Haizhou Li, Junichi Yamagishi

We propose using an extended model architecture of Tacotron, that is a multi-source sequence-to-sequence model with a dual attention mechanism as the shared model for both the TTS and VC tasks.

Decoder Speech Synthesis +1

Training a Neural Speech Waveform Model using Spectral Losses of Short-Time Fourier Transform and Continuous Wavelet Transform

no code implementations29 Mar 2019 Shinji Takaki, Hirokazu Kameoka, Junichi Yamagishi

Recently, we proposed short-time Fourier transform (STFT)-based loss functions for training a neural speech waveform model.

Introduction to Voice Presentation Attack Detection and Recent Advances

no code implementations4 Jan 2019 Md Sahidullah, Hector Delgado, Massimiliano Todisco, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Kong-Aik Lee

Over the past few years significant progress has been made in the field of presentation attack detection (PAD) for automatic speaker recognition (ASV).

Benchmarking Speaker Recognition

Attentive Filtering Networks for Audio Replay Attack Detection

1 code implementation31 Oct 2018 Cheng-I Lai, Alberto Abad, Korin Richmond, Junichi Yamagishi, Najim Dehak, Simon King

In this work, we propose our replay attacks detection system - Attentive Filtering Network, which is composed of an attention-based filtering mechanism that enhances feature representations in both the frequency and time domains, and a ResNet-based classifier.

Speaker Verification

Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks

no code implementations30 Oct 2018 Lauri Juvela, Bajibabu Bollepalli, Junichi Yamagishi, Paavo Alku

The state-of-the-art in text-to-speech synthesis has recently improved considerably due to novel neural waveform generation methods, such as WaveNet.

Image Generation Speech Synthesis +2

Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language

1 code implementation29 Oct 2018 Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi

Towards end-to-end Japanese speech synthesis, we extend Tacotron to systems with self-attention to capture long-term dependencies related to pitch accents and compare their audio quality with classical pipeline systems under various conditions to show their pros and cons.

Speech Synthesis Text-To-Speech Synthesis

STFT spectral loss for training a neural speech waveform model

1 code implementation29 Oct 2018 Shinji Takaki, Toru Nakashika, Xin Wang, Junichi Yamagishi

This paper proposes a new loss using short-time Fourier transform (STFT) spectra for the aim of training a high-performance neural speech waveform model that predicts raw continuous speech waveform samples directly.

Neural source-filter-based waveform model for statistical parametric speech synthesis

no code implementations29 Oct 2018 Xin Wang, Shinji Takaki, Junichi Yamagishi

Neural waveform models such as the WaveNet are used in many recent text-to-speech systems, but the original WaveNet is quite slow in waveform generation because of its autoregressive (AR) structure.

Speech Synthesis

Audiovisual speaker conversion: jointly and simultaneously transforming facial expression and acoustic characteristics

no code implementations29 Oct 2018 Fuming Fang, Xin Wang, Junichi Yamagishi, Isao Echizen

Transforming the facial and acoustic features together makes it possible for the converted voice and facial expressions to be highly correlated and for the generated target speaker to appear and sound natural.

Image Reconstruction

Capsule-Forensics: Using Capsule Networks to Detect Forged Images and Videos

3 code implementations26 Oct 2018 Huy H. Nguyen, Junichi Yamagishi, Isao Echizen

Recent advances in media generation techniques have made it easier for attackers to create forged images and videos.

Image and Video Forgery Detection

MesoNet: a Compact Facial Video Forgery Detection Network

7 code implementations4 Sep 2018 Darius Afchar, Vincent Nozick, Junichi Yamagishi, Isao Echizen

This paper presents a method to automatically and efficiently detect face tampering in videos, and particularly focuses on two recent techniques used to generate hyper-realistic forged videos: Deepfake and Face2Face.

DeepFake Detection Face Swapping +2

Investigating accuracy of pitch-accent annotations in neural network-based speech synthesis and denoising effects

no code implementations2 Aug 2018 Hieu-Thi Luong, Xin Wang, Junichi Yamagishi, Nobuyuki Nishizawa

We investigated the impact of noisy linguistic features on the performance of a Japanese speech synthesis system based on neural network that uses WaveNet vocoder.

Denoising Speech Synthesis

Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder

no code implementations31 Jul 2018 Yi Zhao, Shinji Takaki, Hieu-Thi Luong, Junichi Yamagishi, Daisuke Saito, Nobuaki Minematsu

In order to reduce the mismatched characteristics between natural and generated acoustic features, we propose frameworks that incorporate either a conditional generative adversarial network (GAN) or its variant, Wasserstein GAN with gradient penalty (WGAN-GP), into multi-speaker speech synthesis that uses the WaveNet vocoder.

Generative Adversarial Network Speech Synthesis +1

Scaling and bias codes for modeling speaker-adaptive DNN-based speech synthesis systems

no code implementations31 Jul 2018 Hieu-Thi Luong, Junichi Yamagishi

Most neural-network based speaker-adaptive acoustic models for speech synthesis can be categorized into either layer-based or input-code approaches.

Speech Synthesis

Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis

no code implementations30 Jul 2018 Gustav Eje Henter, Jaime Lorenzo-Trueba, Xin Wang, Junichi Yamagishi

Generating versatile and appropriate synthetic speech requires control over the output expression separate from the spoken text.

Acoustic Modelling Decoder +2

Speaker-independent raw waveform model for glottal excitation

no code implementations25 Apr 2018 Lauri Juvela, Vassilis Tsiaras, Bajibabu Bollepalli, Manu Airaksinen, Junichi Yamagishi, Paavo Alku

Recent speech technology research has seen a growing interest in using WaveNets as statistical vocoders, i. e., generating speech waveforms from acoustic features.

Speech Synthesis Text-To-Speech Synthesis +1

t-DCF: a Detection Cost Function for the Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification

1 code implementation25 Apr 2018 Tomi Kinnunen, Kong Aik Lee, Hector Delgado, Nicholas Evans, Massimiliano Todisco, Md Sahidullah, Junichi Yamagishi, Douglas A. Reynolds

The two challenge editions in 2015 and 2017 involved the assessment of spoofing countermeasures (CMs) in isolation from ASV using an equal error rate (EER) metric.

Speaker Verification

A Spoofing Benchmark for the 2018 Voice Conversion Challenge: Leveraging from Spoofing Countermeasures for Speech Artifact Assessment

no code implementations23 Apr 2018 Tomi Kinnunen, Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Zhen-Hua Ling

As a supplement to subjective results for the 2018 Voice Conversion Challenge (VCC'18) data, we configure a standard constant-Q cepstral coefficient CM to quantify the extent of processing artifacts.

Benchmarking Speaker Verification +1

The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods

no code implementations12 Apr 2018 Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kinnunen, Zhen-Hua Ling

We present the Voice Conversion Challenge 2018, designed as a follow up to the 2016 edition with the aim of providing a common framework for evaluating and comparing different state-of-the-art voice conversion (VC) systems.

Voice Conversion

Transformation on Computer-Generated Facial Image to Avoid Detection by Spoofing Detector

no code implementations12 Apr 2018 Huy H. Nguyen, Ngoc-Dung T. Tieu, Hoang-Quoc Nguyen-Son, Junichi Yamagishi, Isao Echizen

Making computer-generated (CG) images more difficult to detect is an interesting problem in computer graphics and security.

A comparison of recent waveform generation and acoustic modeling methods for neural-network-based speech synthesis

no code implementations7 Apr 2018 Xin Wang, Jaime Lorenzo-Trueba, Shinji Takaki, Lauri Juvela, Junichi Yamagishi

Recent advances in speech synthesis suggest that limitations such as the lossy nature of the amplitude spectrum with minimum phase approximation and the over-smoothing effect in acoustic modeling can be overcome by using advanced machine learning approaches.

Speech Synthesis

Speech waveform synthesis from MFCC sequences with generative adversarial networks

1 code implementation3 Apr 2018 Lauri Juvela, Bajibabu Bollepalli, Xin Wang, Hirokazu Kameoka, Manu Airaksinen, Junichi Yamagishi, Paavo Alku

This paper proposes a method for generating speech from filterbank mel frequency cepstral coefficients (MFCC), which are widely used in speech applications, such as ASR, but are generally considered unusable for speech synthesis.

Generative Adversarial Network Speech Synthesis

High-quality nonparallel voice conversion based on cycle-consistent adversarial network

no code implementations2 Apr 2018 Fuming Fang, Junichi Yamagishi, Isao Echizen, Jaime Lorenzo-Trueba

Although voice conversion (VC) algorithms have achieved remarkable success along with the development of machine learning, superior performance is still difficult to achieve when using nonparallel data.

Generative Adversarial Network Image-to-Image Translation +4

Complex-Valued Restricted Boltzmann Machine for Direct Speech Parameterization from Complex Spectra

no code implementations27 Mar 2018 Toru Nakashika, Shinji Takaki, Junichi Yamagishi

In contrast, the proposed feature extractor using the CRBM directly encodes the complex spectra (or another complex-valued representation of the complex spectra) into binary-valued latent features (hidden units).


Can we steal your vocal identity from the Internet?: Initial investigation of cloning Obama's voice using GAN, WaveNet and low-quality found data

no code implementations2 Mar 2018 Jaime Lorenzo-Trueba, Fuming Fang, Xin Wang, Isao Echizen, Junichi Yamagishi, Tomi Kinnunen

Thanks to the growing availability of spoofing databases and rapid advances in using them, systems for detecting voice spoofing attacks are becoming more and more capable, and error rates close to zero are being reached for the ASVspoof2015 database.

Generative Adversarial Network Speech Enhancement +2

Deep Denoising Auto-encoder for Statistical Speech Synthesis

no code implementations17 Jun 2015 Zhenzhou Wu, Shinji Takaki, Junichi Yamagishi

This paper proposes a deep denoising auto-encoder technique to extract better acoustic features for speech synthesis.

Denoising Speech Synthesis

Cannot find the paper you are looking for? You can Submit a new open access paper.