no code implementations • 17 Oct 2024 • Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans
Recent advancements in Text-to-Speech (TTS) systems have enabled the generation of natural and expressive speech from textual input.
no code implementations • 17 Sep 2024 • Philip H. Lee, Ismail Rasim Ulgen, Berrak Sisman
Voice conversion (VC) aims to modify the speaker's identity while preserving the linguistic content.
no code implementations • 30 Aug 2024 • Ismail Rasim Ulgen, Shreeram Suresh Chandra, Junchen Lu, Berrak Sisman
With SelectTTS, we show that frame selection from the target speaker's speech is a direct way to achieve generalization in unseen speakers with low model complexity.
1 code implementation • 13 Aug 2024 • Perry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman, Dorien Herremans
We attain 25. 3% hanzi CER and 13. 0% pinyin CER with the JETS model.
no code implementations • 5 Jul 2024 • Ismail Rasim Ulgen, Carlos Busso, John H. L. Hansen, Berrak Sisman
The proposed approach introduces variations in the speaker embedding while retaining the speaker recognition performance since model does not have to map all of the utterances of a speaker into a single class center.
1 code implementation • Odyssey: The Speaker and Language Recognition Workshop 2024 • Lucas Goncalves, Ali N. Salman, Abinay R. Naini, Laureano Moro Velazquez, Thomas Thebaud, Leibny Paola Garcia, Najim Dehak, Berrak Sisman, Carlos Busso
The Odyssey 2024 Speech Emotion Recognition (SER) Challenge aims to enhance innovation in recognizing emotions from spontaneous speech, moving beyond traditional datasets derived from acted scenarios.
no code implementations • 6 Jun 2024 • Ali N. Salman, Zongyang Du, Shreeram Suresh Chandra, Ismail Rasim Ulgen, Carlos Busso, Berrak Sisman
Voice conversion (VC) research traditionally depends on scripted or acted speech, which lacks the natural spontaneity of real-life conversations.
no code implementations • 5 Jun 2024 • Ahad Jawaid, Shreeram Suresh Chandra, Junchen Lu, Berrak Sisman
The proposed method replaces the style encoder in a TTS framework with a Mixture of Experts (MoE) layer.
no code implementations • 3 Jun 2024 • Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans
With rapid globalization, the need to build inclusive and representative speech technology cannot be overstated.
no code implementations • 18 May 2024 • Shreeram Suresh Chandra, Zongyang Du, Berrak Sisman
Many frameworks for emotional text-to-speech (E-TTS) rely on human-annotated emotion labels that are often inaccurate and difficult to obtain.
no code implementations • 2 May 2024 • Zongyang Du, Junchen Lu, Kun Zhou, Lakshmish Kaushik, Berrak Sisman
A major challenge of expressive VC lies in emotion prosody modeling.
1 code implementation • 21 Mar 2024 • Thejan Rajapakshe, Rajib Rana, Sara Khalifa, Berrak Sisman, Bjorn W. Schuller, Carlos Busso
This study presents emoDARTS, a DARTS-optimised joint CNN and Sequential Neural Network (SeqNN: LSTM, RNN) architecture that enhances SER performance.
Ranked #1 on Speech Emotion Recognition on MSP-IMPROV
no code implementations • 19 Jan 2024 • Ismail Rasim Ulgen, Zongyang Du, Carlos Busso, Berrak Sisman
In order to leverage this information, we introduce a novel contrastive pretraining approach applied to emotion-unlabeled data for speech emotion recognition.
no code implementations • 29 Jun 2023 • Junchen Lu, Berrak Sisman, Mingyang Zhang, Haizhou Li
The goal of Automatic Voice Over (AVO) is to generate speech in sync with a silent video given its text script.
1 code implementation • 23 May 2023 • Thejan Rajapakshe, Rajib Rana, Sara Khalifa, Berrak Sisman, Björn Schuller
In contrast to previous studies, we refrain from imposing constraints on the order of the layers for the CNN within the DARTS cell; instead, we allow DARTS to determine the optimal layer order autonomously.
Ranked #2 on Speech Emotion Recognition on IEMOCAP (UA metric)
no code implementations • 12 May 2023 • Lucas Goncalves, Seong-Gyun Leem, Wei-Cheng Lin, Berrak Sisman, Carlos Busso
This study proposes a versatile audio-visual learning (VAVL) framework for handling unimodal and multimodal systems for emotion regression or emotion classification tasks.
Ranked #1 on Arousal Estimation on MSP-IMPROV
no code implementations • 14 Nov 2022 • Perry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman, Dorien Herremans
Text-to-speech (TTS) models have achieved remarkable naturalness in recent years, yet like most deep neural models, they have more parameters than necessary.
1 code implementation • 7 Nov 2022 • Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans
Accent plays a significant role in speech communication, influencing one's capability to understand as well as conveying a person's identity.
no code implementations • 25 Oct 2022 • Kun Zhou, Berrak Sisman, Carlos Busso, Bin Ma, Haizhou Li
To achieve this, we propose a novel EVC framework, Mixed-EVC, which only leverages discrete emotion training labels.
no code implementations • 22 Sep 2022 • Perry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman
In this work, we seek to answer the question: what are the characteristics of selected sparse techniques on the performance and model complexity?
no code implementations • 22 Sep 2022 • Rui Liu, Berrak Sisman, Guanglai Gao, Haizhou Li
Accented TTS synthesis is challenging as L2 is different from L1 in both in terms of phonetic rendering and prosody pattern.
no code implementations • 11 Aug 2022 • Kun Zhou, Berrak Sisman, Rajib Rana, B. W. Schuller, Haizhou Li
We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework.
1 code implementation • 15 Jun 2022 • Rui Liu, Berrak Sisman, Björn Schuller, Guanglai Gao, Haizhou Li
In this paper, we propose a data-driven deep learning model, i. e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech.
no code implementations • 10 Jan 2022 • Kun Zhou, Berrak Sisman, Rajib Rana, Björn W. Schuller, Haizhou Li
As desired, the proposed network controls the fine-grained emotion intensity in the output speech.
no code implementations • 20 Oct 2021 • Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li
Expressive voice conversion performs identity conversion for emotional speakers by jointly converting speaker identity and emotional style.
no code implementations • 13 Oct 2021 • Sergey Nikonorov, Berrak Sisman, Mingyang Zhang, Haizhou Li
At the same time, as the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
no code implementations • 7 Oct 2021 • Junchen Lu, Berrak Sisman, Rui Liu, Mingyang Zhang, Haizhou Li
The proposed VisualTTS adopts two novel mechanisms that are 1) textual-visual attention, and 2) visual fusion strategy during acoustic decoding, which both contribute to forming accurate alignment between the input text content and lip motion in input lip sequence.
1 code implementation • 7 Oct 2021 • Rui Liu, Berrak Sisman, Haizhou Li
The emotion strength of synthesized speech can be controlled flexibly using a strength descriptor, which is obtained by an emotion attribute ranking function.
no code implementations • 8 Jul 2021 • Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li
Traditional voice conversion(VC) has been focused on speaker identity conversion for speech with a neutral expression.
1 code implementation • 31 May 2021 • Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li
In this paper, we first provide a review of the state-of-the-art emotional voice conversion research, and the existing emotional speech databases.
no code implementations • 3 Apr 2021 • Rui Liu, Berrak Sisman, Haizhou Li
To our best knowledge, this is the first study of reinforcement learning in emotional text-to-speech synthesis.
2 code implementations • 31 Mar 2021 • Kun Zhou, Berrak Sisman, Haizhou Li
In stage 2, we perform emotion training with a limited amount of emotional speech data, to learn how to disentangle emotional style and linguistic information from the speech.
no code implementations • 3 Nov 2020 • Kun Zhou, Berrak Sisman, Haizhou Li
Emotional voice conversion (EVC) aims to convert the emotion of speech from one state to another while preserving the linguistic content and speaker identity.
2 code implementations • 28 Oct 2020 • Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity.
no code implementations • 23 Oct 2020 • Rui Liu, Berrak Sisman, Haizhou Li
Attention-based end-to-end text-to-speech synthesis (TTS) is superior to conventional statistical methods in many ways.
no code implementations • 11 Aug 2020 • Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, Haizhou Li
We propose a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks.
no code implementations • 11 Aug 2020 • Zongyang Du, Kun Zhou, Berrak Sisman, Haizhou Li
It relies on non-parallel training data from two different languages, hence, is more challenging than mono-lingual voice conversion.
no code implementations • 10 Aug 2020 • Junchen Lu, Kun Zhou, Berrak Sisman, Haizhou Li
We train an encoder to disentangle singer identity and singing prosody (F0 contour) from phonetic content.
1 code implementation • 13 May 2020 • Kun Zhou, Berrak Sisman, Mingyang Zhang, Haizhou Li
We consider that there is a common code between speakers for emotional expression in a spoken language, therefore, a speaker-independent mapping between emotional states is possible.
no code implementations • 2 Feb 2020 • Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, Haizhou Li
To address this problem, we propose a new training scheme for Tacotron-based TTS, referred to as WaveTTS, that has 2 loss functions: 1) time-domain loss, denoted as the waveform loss, that measures the distortion between the natural and generated waveform; and 2) frequency-domain loss, that measures the Mel-scale acoustic feature loss between the natural and generated acoustic features.
1 code implementation • 1 Feb 2020 • Kun Zhou, Berrak Sisman, Haizhou Li
Many studies require parallel speech data between different emotional patterns, which is not practical in real life.
no code implementations • 7 Nov 2019 • Rui Liu, Berrak Sisman, Jingdong Li, Feilong Bao, Guanglai Gao, Haizhou Li
We first train a Tacotron2-based TTS model by always providing natural speech frames to the decoder, that serves as a teacher model.
no code implementations • 27 May 2019 • Andros Tjandra, Berrak Sisman, Mingyang Zhang, Sakriani Sakti, Haizhou Li, Satoshi Nakamura
Our proposed approach significantly improved the intelligibility (in CER), the MOS, and discrimination ABX scores compared to the official ZeroSpeech 2019 baseline or even the topline.