Search Results for author: Berrak Sisman

Found 32 papers, 8 papers with code

Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset

2 code implementations28 Oct 2020 Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li

Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity.

Generative Adversarial Network Speech Emotion Recognition +2

Emotional Voice Conversion: Theory, Databases and ESD

1 code implementation31 May 2021 Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li

In this paper, we first provide a review of the state-of-the-art emotional voice conversion research, and the existing emotional speech databases.

Voice Conversion

Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data

1 code implementation1 Feb 2020 Kun Zhou, Berrak Sisman, Haizhou Li

Many studies require parallel speech data between different emotional patterns, which is not practical in real life.

Voice Conversion

Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training

2 code implementations31 Mar 2021 Kun Zhou, Berrak Sisman, Haizhou Li

In stage 2, we perform emotion training with a limited amount of emotional speech data, to learn how to disentangle emotional style and linguistic information from the speech.

Voice Conversion

Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion

1 code implementation13 May 2020 Kun Zhou, Berrak Sisman, Mingyang Zhang, Haizhou Li

We consider that there is a common code between speakers for emotional expression in a spoken language, therefore, a speaker-independent mapping between emotional states is possible.

Voice Conversion

StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis

1 code implementation7 Oct 2021 Rui Liu, Berrak Sisman, Haizhou Li

The emotion strength of synthesized speech can be controlled flexibly using a strength descriptor, which is obtained by an emotion attribute ranking function.

Attribute Data Augmentation +2

Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning

1 code implementation15 Jun 2022 Rui Liu, Berrak Sisman, Björn Schuller, Guanglai Gao, Haizhou Li

In this paper, we propose a data-driven deep learning model, i. e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech.

Attribute Emotion Classification +2

Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder

1 code implementation7 Nov 2022 Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans

Accent plays a significant role in speech communication, influencing understanding capabilities and also conveying a person's identity.

Speech Synthesis Text-To-Speech Synthesis

VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019

no code implementations27 May 2019 Andros Tjandra, Berrak Sisman, Mingyang Zhang, Sakriani Sakti, Haizhou Li, Satoshi Nakamura

Our proposed approach significantly improved the intelligibility (in CER), the MOS, and discrimination ABX scores compared to the official ZeroSpeech 2019 baseline or even the topline.

Clustering

Teacher-Student Training for Robust Tacotron-based TTS

no code implementations7 Nov 2019 Rui Liu, Berrak Sisman, Jingdong Li, Feilong Bao, Guanglai Gao, Haizhou Li

We first train a Tacotron2-based TTS model by always providing natural speech frames to the decoder, that serves as a teacher model.

Knowledge Distillation

WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss

no code implementations2 Feb 2020 Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, Haizhou Li

To address this problem, we propose a new training scheme for Tacotron-based TTS, referred to as WaveTTS, that has 2 loss functions: 1) time-domain loss, denoted as the waveform loss, that measures the distortion between the natural and generated waveform; and 2) frequency-domain loss, that measures the Mel-scale acoustic feature loss between the natural and generated acoustic features.

VAW-GAN for Singing Voice Conversion with Non-parallel Training Data

no code implementations10 Aug 2020 Junchen Lu, Kun Zhou, Berrak Sisman, Haizhou Li

We train an encoder to disentangle singer identity and singing prosody (F0 contour) from phonetic content.

Generative Adversarial Network Voice Conversion

Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with CycleGAN

no code implementations11 Aug 2020 Zongyang Du, Kun Zhou, Berrak Sisman, Haizhou Li

It relies on non-parallel training data from two different languages, hence, is more challenging than mono-lingual voice conversion.

Voice Conversion

Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS

no code implementations11 Aug 2020 Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, Haizhou Li

We propose a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks.

Multi-Task Learning Speech Synthesis

GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis

no code implementations23 Oct 2020 Rui Liu, Berrak Sisman, Haizhou Li

Attention-based end-to-end text-to-speech synthesis (TTS) is superior to conventional statistical methods in many ways.

Graph Attention Sentence +2

VAW-GAN for Disentanglement and Recomposition of Emotional Elements in Speech

no code implementations3 Nov 2020 Kun Zhou, Berrak Sisman, Haizhou Li

Emotional voice conversion (EVC) aims to convert the emotion of speech from one state to another while preserving the linguistic content and speaker identity.

Disentanglement Generative Adversarial Network +1

VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over

no code implementations7 Oct 2021 Junchen Lu, Berrak Sisman, Rui Liu, Mingyang Zhang, Haizhou Li

The proposed VisualTTS adopts two novel mechanisms that are 1) textual-visual attention, and 2) visual fusion strategy during acoustic decoding, which both contribute to forming accurate alignment between the input text content and lip motion in input lip sequence.

Speech Synthesis

DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding

no code implementations13 Oct 2021 Sergey Nikonorov, Berrak Sisman, Mingyang Zhang, Haizhou Li

At the same time, as the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.

Speech Synthesis Voice Conversion

Disentanglement of Emotional Style and Speaker Identity for Expressive Voice Conversion

no code implementations20 Oct 2021 Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li

Expressive voice conversion performs identity conversion for emotional speakers by jointly converting speaker identity and emotional style.

Disentanglement Voice Conversion

Speech Synthesis with Mixed Emotions

no code implementations11 Aug 2022 Kun Zhou, Berrak Sisman, Rajib Rana, B. W. Schuller, Haizhou Li

We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework.

Attribute Emotional Speech Synthesis

Controllable Accented Text-to-Speech Synthesis

no code implementations22 Sep 2022 Rui Liu, Berrak Sisman, Guanglai Gao, Haizhou Li

Accented TTS synthesis is challenging as L2 is different from L1 in both in terms of phonetic rendering and prosody pattern.

Speech Synthesis Text-To-Speech Synthesis

EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models

no code implementations22 Sep 2022 Perry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman

In this work, we seek to answer the question: what are the characteristics of selected sparse techniques on the performance and model complexity?

Speech Synthesis Text-To-Speech Synthesis

Mixed-EVC: Mixed Emotion Synthesis and Control in Voice Conversion

no code implementations25 Oct 2022 Kun Zhou, Berrak Sisman, Carlos Busso, Bin Ma, Haizhou Li

To achieve this, we propose a novel EVC framework, Mixed-EVC, which only leverages discrete emotion training labels.

Attribute Voice Conversion

SNIPER Training: Variable Sparsity Rate Training For Text-To-Speech

no code implementations14 Nov 2022 Perry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman, Dorien Herremans

Text-to-speech (TTS) models have achieved remarkable naturalness in recent years, yet like most deep neural models, they have more parameters than necessary.

Versatile Audio-Visual Learning for Handling Single and Multi Modalities in Emotion Regression and Classification Tasks

no code implementations12 May 2023 Lucas Goncalves, Seong-Gyun Leem, Wei-Cheng Lin, Berrak Sisman, Carlos Busso

This study proposes a \emph{versatile audio-visual learning} (VAVL) framework for handling unimodal and multimodal systems for emotion regression and emotion classification tasks.

Arousal Estimation Attribute +7

Enhancing Speech Emotion Recognition Through Differentiable Architecture Search

no code implementations23 May 2023 Thejan Rajapakshe, Rajib Rana, Sara Khalifa, Berrak Sisman, Björn Schuller

In contrast to previous studies, we refrain from imposing constraints on the order of the layers for the CNN within the DARTS cell; instead, we allow DARTS to determine the optimal layer order autonomously.

Neural Architecture Search Speech Emotion Recognition

Revealing Emotional Clusters in Speaker Embeddings: A Contrastive Learning Strategy for Speech Emotion Recognition

no code implementations19 Jan 2024 Ismail Rasim Ulgen, Zongyang Du, Carlos Busso, Berrak Sisman

In order to leverage this information, we introduce a novel contrastive pretraining approach applied to emotion-unlabeled data for speech emotion recognition.

Contrastive Learning Speech Emotion Recognition

Cannot find the paper you are looking for? You can Submit a new open access paper.