no code implementations • 12 Feb 2024 • Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszyńska, Haohan Guo, Bartosz Putrycz, Soledad López Gambino, Kayeon Yoo, Elena Sokolova, Thomas Drugman
Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences.
no code implementations • 4 Sep 2023 • Marcel Granero-Moya, Penny Karanasou, Sri Karlapati, Bastian Schnell, Nicole Peinelt, Alexis Moinet, Thomas Drugman
In this study, we aim to address this gap by conducting a comparative analysis of different PLMs for two TTS tasks: prosody prediction and pause prediction.
no code implementations • 13 Jul 2023 • Arnaud Joly, Marco Nicolis, Ekaterina Peterova, Alessandro Lombardi, Ammar Abbas, Arent van Korlaar, Aman Hussain, Parul Sharma, Alexis Moinet, Mateusz Lajszczak, Penny Karanasou, Antonio Bonafonte, Thomas Drugman, Elena Sokolova
We show that this technique significantly closes the gap to methods that require explicit recordings.
no code implementations • 20 Jun 2023 • Ammar Abbas, Sri Karlapati, Bastian Schnell, Penny Karanasou, Marcel Granero Moya, Amith Nagaraj, Ayman Boustati, Nicole Peinelt, Alexis Moinet, Thomas Drugman
We show that eCat statistically significantly reduces the gap in naturalness between CopyCat2 and human recordings by an average of 46. 7% across 2 languages, 3 locales, and 7 speakers, along with better target-speaker similarity in FPT.
no code implementations • 2 Jul 2022 • Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Bozena Kostek
We show that these techniques not only improve the accuracy of three machine learning models for detecting pronunciation errors but also help establish a new state-of-the-art in the field.
no code implementations • 29 Jun 2022 • Peter Makarov, Ammar Abbas, Mateusz Łajszczak, Arnaud Joly, Sri Karlapati, Alexis Moinet, Thomas Drugman, Penny Karanasou
In this paper, we examine simple extensions to a Transformer-based FastSpeech-like system, with the goal of improving prosody for multi-sentence TTS.
no code implementations • 28 Jun 2022 • Ammar Abbas, Thomas Merritt, Alexis Moinet, Sri Karlapati, Ewa Muszynska, Simon Slangen, Elia Gatti, Thomas Drugman
First, we propose a duration model conditioned on phrasing that improves the predicted durations and provides better modelling of pauses.
no code implementations • 27 Jun 2022 • Sri Karlapati, Penny Karanasou, Mateusz Lajszczak, Ammar Abbas, Alexis Moinet, Peter Makarov, Ray Li, Arent van Korlaar, Simon Slangen, Thomas Drugman
In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing speech with different speaker identities, b) generating speech with expressive and contextually appropriate prosody, and c) transferring prosody at fine-grained level between any pair of seen speakers.
no code implementations • 13 Feb 2022 • Mateusz Lajszczak, Animesh Prasad, Arent van Korlaar, Bajibabu Bollepalli, Antonio Bonafonte, Arnaud Joly, Marco Nicolis, Alexis Moinet, Thomas Drugman, Trevor Wood, Elena Sokolova
This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data.
no code implementations • 29 Jun 2021 • Ammar Abbas, Bajibabu Bollepalli, Alexis Moinet, Arnaud Joly, Penny Karanasou, Peter Makarov, Simon Slangens, Sri Karlapati, Thomas Drugman
We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody.
1 code implementation • 16 Jun 2021 • Alejandro Mottini, Jaime Lorenzo-Trueba, Sri Vishnu Kumar Karlapati, Thomas Drugman
Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker.
no code implementations • 14 Jun 2021 • Penny Karanasou, Sri Karlapati, Alexis Moinet, Arnaud Joly, Ammar Abbas, Simon Slangen, Jaime Lorenzo Trueba, Thomas Drugman
Many factors influence speech yielding different renditions of a given sentence.
no code implementations • 7 Jun 2021 • Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Shira Calamaro, Bozena Kostek
To train this model, phonetically transcribed L2 speech is not required and we only need to mark mispronounced words.
no code implementations • 16 Jan 2021 • Daniel Korzekwa, Jaime Lorenzo-Trueba, Szymon Zaporowski, Shira Calamaro, Thomas Drugman, Bozena Kostek
A common approach to the automatic detection of mispronunciation in language learning is to recognize the phonemes produced by a student and compare it to the expected pronunciation of a native speaker.
no code implementations • 14 Jan 2021 • Bastian Schnell, Goeric Huybrechts, Bartek Perz, Thomas Drugman, Jaime Lorenzo-Trueba
In this work we propose EmoCat, a language-agnostic emotional voice conversion model.
no code implementations • 29 Dec 2020 • Daniel Korzekwa, Roberto Barra-Chicote, Szymon Zaporowski, Grzegorz Beringer, Jaime Lorenzo-Trueba, Alicja Serafinowicz, Jasha Droppo, Thomas Drugman, Bozena Kostek
This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS).
no code implementations • 4 Nov 2020 • Sri Karlapati, Ammar Abbas, Zack Hodari, Alexis Moinet, Arnaud Joly, Penny Karanasou, Thomas Drugman
In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text.
no code implementations • 7 Jun 2020 • Benjamin Picart, Thomas Drugman, Thierry Dutoit
This paper focuses on the analysis and synthesis of hypo and hyperarticulated speech in the framework of HMM-based speech synthesis.
no code implementations • 7 Jun 2020 • Thomas Drugman
The proposed method is shown to significantly increase the sparsity of the LP residual signal and to be effective in two illustrative applications: speech polarity detection and excitation modeling.
no code implementations • 7 Jun 2020 • Onur Babacan, Thomas Drugman, Tuomo Raitio, Daniel Erro, Thierry Dutoit
Various parametric representations have been proposed to model the speech signal.
no code implementations • 31 May 2020 • Thomas Drugman
Detecting the correct speech polarity is a necessary step prior to several speech processing techniques.
no code implementations • 31 May 2020 • Thomas Drugman, John Kane, Christer Gobl
This paper investigates the temporal excitation patterns of creaky voice.
no code implementations • 31 May 2020 • Thomas Drugman, Yannis Stylianou
Recent studies have shown that its proper estimation and modeling enhance the quality of statistical parametric speech synthesizers.
no code implementations • 24 May 2020 • Thomas Drugman, Thomas Dubuisson, Alexis Moinet, Nicolas D'Alessandro, Thierry Dutoit
This paper addresses the problem of estimating the voice source directly from speech waveforms.
no code implementations • 16 May 2020 • Thomas Drugman, Thierry Dutoit
An inversion of the speech polarity may have a dramatic detrimental effect on the performance of various techniques of speech processing.
no code implementations • 16 May 2020 • Thomas Drugman, Baris Bozkurt, Thierry Dutoit
In a previous work, we showed that the glottal source can be estimated from speech signals by computing the Zeros of the Z-Transform (ZZT).
no code implementations • 10 May 2020 • Thomas Drugman, Thierry Dutoit
It was recently shown that complex cepstrum can be effectively used for glottal flow estimation by separating the causal and anticausal components of speech.
no code implementations • 2 Jan 2020 • Thomas Drugman, Thomas Dubuisson, Thierry Dutoit
This paper addresses the problem of automatic detection of voice pathologies directly from the speech signal.
no code implementations • 2 Jan 2020 • Thomas Drugman, Thomas Dubuisson, Thierry Dutoit
In most current approaches of speech processing, information is extracted from the magnitude spectrum.
no code implementations • 2 Jan 2020 • Thomas Drugman, Geoffrey Wilfart, Thierry Dutoit
Statistical parametric speech synthesizers have recently shown their ability to produce natural-sounding and flexible voices.
no code implementations • 2 Jan 2020 • Thomas Drugman, Thierry Dutoit
This paper addresses the problem of pitch modification, as an important module for an efficient voice transformation system.
no code implementations • 2 Jan 2020 • Thomas Drugman, Thierry Dutoit, Baris Bozkurt
This paper investigates the differences occuring in the excitation for different voice qualities.
no code implementations • 30 Dec 2019 • Thomas Drugman, Alexis Moinet, Thierry Dutoit, Geoffrey Wilfart
The source signal is obtained by concatenating excitation frames picked up from the codebook, based on a selection criterion and taking target residual coefficients as input.
no code implementations • 30 Dec 2019 • Thomas Drugman, Baris Bozkurt, Thierry Dutoit
Via a systematic study of the windowing effects on the deconvolution quality, we show that the complex cepstrum causal-anticausal decomposition can be effectively used for glottal flow estimation when specific windowing criteria are met.
no code implementations • 29 Dec 2019 • Thomas Drugman, Thierry Dutoit
The applicability of the DSM in two fields of speech processing is then studied.
no code implementations • 29 Dec 2019 • Thomas Drugman, Paavo Alku, Abeer Alwan, Bayya Yegnanarayana
The great majority of current voice technology applications relies on acoustic features characterizing the vocal tract response, such as the widely used MFCC of LPC parameters.
no code implementations • 29 Dec 2019 • Thomas Drugman, Baris Bozkurt, Thierry Dutoit
Homomorphic analysis is a well-known method for the separation of non-linearly combined signals.
no code implementations • 29 Dec 2019 • Thomas Drugman, Geoffrey Wilfart, Thierry Dutoit
For this, we hereby propose an adaptation of the Deterministic plus Stochastic Model (DSM) for the residual.
no code implementations • 28 Dec 2019 • Thomas Drugman, Mark Thomas, Jon Gudnason, Patrick Naylor, Thierry Dutoit
The five techniques compared are the Hilbert Envelope-based detection (HE), the Zero Frequency Resonator-based method (ZFR), the Dynamic Programming Phase Slope Algorithm (DYPSA), the Speech Event Detection using the Residual Excitation And a Mean-based Signal (SEDREAMS) and the Yet Another GCI Algorithm (YAGA).
no code implementations • 28 Dec 2019 • Thomas Drugman, Abeer Alwan
This paper focuses on the problem of pitch tracking in noisy conditions.
no code implementations • 28 Dec 2019 • Thomas Drugman, Baris Bozkurt, Thierry Dutoit
Techniques based on the mixed-phase decomposition and on a closed-phase inverse filtering process turn out to give the best results on both clean synthetic and real speech signals.
no code implementations • 28 Dec 2019 • Thomas Drugman, Thierry Dutoit
This paper proposes a new procedure to detect Glottal Closure and Opening Instants (GCIs and GOIs) directly from speech waveforms.
no code implementations • 12 Dec 2019 • Orazio Angelini, Alexis Moinet, Kayoko Yanagisawa, Thomas Drugman
We present UTACO, a singing synthesis model based on an attention-based sequence-to-sequence mechanism and a vocoder based on dilated causal convolutions.
no code implementations • 11 Dec 2019 • Marius Cotescu, Thomas Drugman, Goeric Huybrechts, Jaime Lorenzo-Trueba, Alexis Moinet
We present an approach to synthesize whisper by applying a handcrafted signal processing recipe and Voice Conversion (VC) techniques to convert normally phonated speech to whispered speech.
no code implementations • 2 Dec 2019 • Shubhi Tyagi, Marco Nicolis, Jonas Rohnke, Thomas Drugman, Jaime Lorenzo-Trueba
Recent advances in Text-to-Speech (TTS) have improved quality and naturalness to near-human capabilities when considering isolated sentences.
no code implementations • 10 Jul 2019 • Daniel Korzekwa, Roberto Barra-Chicote, Bozena Kostek, Thomas Drugman, Mateusz Lajszczak
This paper proposed a novel approach for the detection and reconstruction of dysarthric speech.
no code implementations • 4 Jul 2019 • Viacheslav Klimkov, Srikanth Ronanki, Jonas Rohnke, Thomas Drugman
However, when trained on a single-speaker dataset, the conventional prosody transfer systems are not robust enough to speaker variability, especially in the case of a reference signal coming from an unseen speaker.
1 code implementation • 4 Jul 2019 • Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote, Alexis Moinet, Vatsal Aggarwal
This vocoder is shown to be capable of generating speech of consistently good quality (98% relative mean MUSHRA when compared to natural speech) regardless of whether the input spectrogram comes from a speaker or style seen during training or from an out-of-domain scenario when the recording conditions are studio-quality.
1 code implementation • NAACL 2019 • Nishant Prateek, Mateusz Łajszczak, Roberto Barra-Chicote, Thomas Drugman, Jaime Lorenzo-Trueba, Thomas Merritt, Srikanth Ronanki, Trevor Wood
Neural text-to-speech synthesis (NTTS) models have shown significant progress in generating high-quality speech, however they require a large quantity of training data.
no code implementations • 7 Mar 2019 • Thomas Drugman, Janne Pylkkonen, Reinhard Kneser
The goal of this paper is to simulate the benefits of jointly applying active learning (AL) and semi-supervised training (SST) in a new speech recognition application.
no code implementations • 4 Mar 2019 • Thomas Drugman, Goeric Huybrechts, Viacheslav Klimkov, Alexis Moinet
In this paper, we consider voicing detection as a classification problem and F0 contour estimation as a regression problem.
8 code implementations • 15 Nov 2018 • Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote
This paper introduces a robust universal neural vocoder trained with 74 speakers (comprised of both genders) coming from 17 languages.
no code implementations • 15 Nov 2018 • Javier Latorre, Jakub Lachowicz, Jaime Lorenzo-Trueba, Thomas Merritt, Thomas Drugman, Srikanth Ronanki, Klimkov Viacheslav
Recent speech synthesis systems based on sampling from autoregressive neural networks models can generate speech almost undistinguishable from human recordings.
no code implementations • 20 Sep 2018 • Zeynab Raeesy, Kellen Gillespie, Zhenpei Yang, Chengyuan Ma, Thomas Drugman, Jiacheng Gu, Roland Maas, Ariya Rastrow, Björn Hoffmeister
We prove that, with enough data, the LSTM model is indeed as capable of learning whisper characteristics from LFBE features alone compared to a simpler MLP model that uses both LFBE and features engineered for separating whisper and normal speech.