Search Results for author: Berrak Sisman

In stage 2, we perform emotion training with a limited amount of emotional speech data, to learn how to disentangle emotional style and linguistic information from the speech.

Voice Conversion

Paper
Code

Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion

1 code implementation • 13 May 2020 • Kun Zhou, Berrak Sisman, Mingyang Zhang, Haizhou Li

We consider that there is a common code between speakers for emotional expression in a spoken language, therefore, a speaker-independent mapping between emotional states is possible.

Voice Conversion

Paper
Code

StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis

1 code implementation • 7 Oct 2021 • Rui Liu, Berrak Sisman, Haizhou Li

The emotion strength of synthesized speech can be controlled flexibly using a strength descriptor, which is obtained by an emotion attribute ranking function.

Attribute Data Augmentation +2

Paper
Code

Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning

1 code implementation • 15 Jun 2022 • Rui Liu, Berrak Sisman, Björn Schuller, Guanglai Gao, Haizhou Li

In this paper, we propose a data-driven deep learning model, i. e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech.

Attribute Emotion Classification +2

Paper
Code

Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder

1 code implementation • 7 Nov 2022 • Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans

Accent plays a significant role in speech communication, influencing understanding capabilities and also conveying a person's identity.

Speech Synthesis Text-To-Speech Synthesis

Paper
Code

VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019

no code implementations • 27 May 2019 • Andros Tjandra, Berrak Sisman, Mingyang Zhang, Sakriani Sakti, Haizhou Li, Satoshi Nakamura

Our proposed approach significantly improved the intelligibility (in CER), the MOS, and discrimination ABX scores compared to the official ZeroSpeech 2019 baseline or even the topline.

Clustering

Paper
Add Code

Teacher-Student Training for Robust Tacotron-based TTS

no code implementations • 7 Nov 2019 • Rui Liu, Berrak Sisman, Jingdong Li, Feilong Bao, Guanglai Gao, Haizhou Li

We first train a Tacotron2-based TTS model by always providing natural speech frames to the decoder, that serves as a teacher model.

Knowledge Distillation

Paper
Add Code

WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss

no code implementations • 2 Feb 2020 • Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, Haizhou Li

To address this problem, we propose a new training scheme for Tacotron-based TTS, referred to as WaveTTS, that has 2 loss functions: 1) time-domain loss, denoted as the waveform loss, that measures the distortion between the natural and generated waveform; and 2) frequency-domain loss, that measures the Mel-scale acoustic feature loss between the natural and generated acoustic features.

Paper
Add Code

VAW-GAN for Singing Voice Conversion with Non-parallel Training Data

no code implementations • 10 Aug 2020 • Junchen Lu, Kun Zhou, Berrak Sisman, Haizhou Li

We train an encoder to disentangle singer identity and singing prosody (F0 contour) from phonetic content.

Generative Adversarial Network Voice Conversion

Paper
Add Code

Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with CycleGAN

no code implementations • 11 Aug 2020 • Zongyang Du, Kun Zhou, Berrak Sisman, Haizhou Li

It relies on non-parallel training data from two different languages, hence, is more challenging than mono-lingual voice conversion.

Voice Conversion

Paper
Add Code

Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS

no code implementations • 11 Aug 2020 • Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, Haizhou Li

We propose a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks.

Multi-Task Learning Speech Synthesis

Paper
Add Code

GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis

no code implementations • 23 Oct 2020 • Rui Liu, Berrak Sisman, Haizhou Li

Attention-based end-to-end text-to-speech synthesis (TTS) is superior to conventional statistical methods in many ways.

Graph Attention Sentence +2

Paper
Add Code

VAW-GAN for Disentanglement and Recomposition of Emotional Elements in Speech

no code implementations • 3 Nov 2020 • Kun Zhou, Berrak Sisman, Haizhou Li

Emotional voice conversion (EVC) aims to convert the emotion of speech from one state to another while preserving the linguistic content and speaker identity.

Disentanglement Generative Adversarial Network +1

Paper
Add Code

Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability

no code implementations • 3 Apr 2021 • Rui Liu, Berrak Sisman, Haizhou Li

To our best knowledge, this is the first study of reinforcement learning in emotional text-to-speech synthesis.

reinforcement-learning Reinforcement Learning (RL) +3

Paper
Add Code

Expressive Voice Conversion: A Joint Framework for Speaker Identity and Emotional Style Transfer

no code implementations • 8 Jul 2021 • Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li

Traditional voice conversion(VC) has been focused on speaker identity conversion for speech with a neutral expression.

Speech Emotion Recognition Style Transfer +1

Paper
Add Code

VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over

no code implementations • 7 Oct 2021 • Junchen Lu, Berrak Sisman, Rui Liu, Mingyang Zhang, Haizhou Li

The proposed VisualTTS adopts two novel mechanisms that are 1) textual-visual attention, and 2) visual fusion strategy during acoustic decoding, which both contribute to forming accurate alignment between the input text content and lip motion in input lip sequence.

Speech Synthesis

Paper
Add Code

DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding

no code implementations • 13 Oct 2021 • Sergey Nikonorov, Berrak Sisman, Mingyang Zhang, Haizhou Li

At the same time, as the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.

Speech Synthesis Voice Conversion

Paper
Add Code

Disentanglement of Emotional Style and Speaker Identity for Expressive Voice Conversion

no code implementations • 20 Oct 2021 • Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li

Expressive voice conversion performs identity conversion for emotional speakers by jointly converting speaker identity and emotional style.

Disentanglement Voice Conversion

Paper
Add Code

Emotion Intensity and its Control for Emotional Voice Conversion

no code implementations • 10 Jan 2022 • Kun Zhou, Berrak Sisman, Rajib Rana, Björn W. Schuller, Haizhou Li

As desired, the proposed network controls the fine-grained emotion intensity in the output speech.

Emotion Classification Voice Conversion

Paper
Add Code

Speech Synthesis with Mixed Emotions

no code implementations • 11 Aug 2022 • Kun Zhou, Berrak Sisman, Rajib Rana, B. W. Schuller, Haizhou Li

We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework.

Attribute Emotional Speech Synthesis

Paper
Add Code

Controllable Accented Text-to-Speech Synthesis

no code implementations • 22 Sep 2022 • Rui Liu, Berrak Sisman, Guanglai Gao, Haizhou Li

Accented TTS synthesis is challenging as L2 is different from L1 in both in terms of phonetic rendering and prosody pattern.

Speech Synthesis Text-To-Speech Synthesis

Paper
Add Code

EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models

no code implementations • 22 Sep 2022 • Perry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman

In this work, we seek to answer the question: what are the characteristics of selected sparse techniques on the performance and model complexity?

Speech Synthesis Text-To-Speech Synthesis

Paper
Add Code

Mixed-EVC: Mixed Emotion Synthesis and Control in Voice Conversion

no code implementations • 25 Oct 2022 • Kun Zhou, Berrak Sisman, Carlos Busso, Bin Ma, Haizhou Li

To achieve this, we propose a novel EVC framework, Mixed-EVC, which only leverages discrete emotion training labels.

Attribute Voice Conversion

Paper
Add Code

SNIPER Training: Variable Sparsity Rate Training For Text-To-Speech

no code implementations • 14 Nov 2022 • Perry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman, Dorien Herremans

Text-to-speech (TTS) models have achieved remarkable naturalness in recent years, yet like most deep neural models, they have more parameters than necessary.

Paper
Add Code

Versatile Audio-Visual Learning for Handling Single and Multi Modalities in Emotion Regression and Classification Tasks

no code implementations • 12 May 2023 • Lucas Goncalves, Seong-Gyun Leem, Wei-Cheng Lin, Berrak Sisman, Carlos Busso

This study proposes a \emph{versatile audio-visual learning} (VAVL) framework for handling unimodal and multimodal systems for emotion regression and emotion classification tasks.

Ranked #1 on Video Emotion Recognition on CREMA-D

Arousal Estimation Attribute +7

Paper
Add Code

Enhancing Speech Emotion Recognition Through Differentiable Architecture Search

no code implementations • 23 May 2023 • Thejan Rajapakshe, Rajib Rana, Sara Khalifa, Berrak Sisman, Björn Schuller

In contrast to previous studies, we refrain from imposing constraints on the order of the layers for the CNN within the DARTS cell; instead, we allow DARTS to determine the optimal layer order autonomously.

Ranked #5 on Speech Emotion Recognition on IEMOCAP (UA metric)

Neural Architecture Search Speech Emotion Recognition

Paper
Add Code

High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units

no code implementations • 29 Jun 2023 • Junchen Lu, Berrak Sisman, Mingyang Zhang, Haizhou Li

The goal of Automatic Voice Over (AVO) is to generate speech in sync with a silent video given its text script.

Speech Synthesis Text-To-Speech Synthesis

Paper
Add Code

Revealing Emotional Clusters in Speaker Embeddings: A Contrastive Learning Strategy for Speech Emotion Recognition

no code implementations • 19 Jan 2024 • Ismail Rasim Ulgen, Zongyang Du, Carlos Busso, Berrak Sisman

In order to leverage this information, we introduce a novel contrastive pretraining approach applied to emotion-unlabeled data for speech emotion recognition.

Contrastive Learning Speech Emotion Recognition

Paper
Add Code

emoDARTS: Joint Optimisation of CNN & Sequential Neural Network Architectures for Superior Speech Emotion Recognition

no code implementations • 21 Mar 2024 • Thejan Rajapakshe, Rajib Rana, Sara Khalifa, Berrak Sisman, Bjorn W. Schuller, Carlos Busso

This study presents emoDARTS, a DARTS-optimised joint CNN and Sequential Neural Network (SeqNN: LSTM, RNN) architecture that enhances SER performance.

Ranked #1 on Speech Emotion Recognition on MSP-IMPROV

Neural Architecture Search Speech Emotion Recognition

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.