Search Results for author: Thomas Drugman

Found 54 papers, 4 papers with code

Robust universal neural vocoding

8 code implementations • 15 Nov 2018 • Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote

This paper introduces a robust universal neural vocoder trained with 74 speakers (comprised of both genders) coming from 17 languages.

308

Paper
Code

Towards achieving robust universal neural vocoding

1 code implementation • 4 Jul 2019 • Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote, Alexis Moinet, Vatsal Aggarwal

This vocoder is shown to be capable of generating speech of consistently good quality (98% relative mean MUSHRA when compared to natural speech) regardless of whether the input spectrogram comes from a speaker or style seen during training or from an out-of-domain scenario when the recording conditions are studio-quality.

234

Paper
Code

In Other News: A Bi-style Text-to-speech Model for Synthesizing Newscaster Voice with Limited Data

1 code implementation • NAACL 2019 • Nishant Prateek, Mateusz Łajszczak, Roberto Barra-Chicote, Thomas Drugman, Jaime Lorenzo-Trueba, Thomas Merritt, Srikanth Ronanki, Trevor Wood

Neural text-to-speech synthesis (NTTS) models have shown significant progress in generating high-quality speech, however they require a large quantity of training data.

Speech Synthesis Text-To-Speech Synthesis +1

Paper
Code

Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments

1 code implementation • 16 Jun 2021 • Alejandro Mottini, Jaime Lorenzo-Trueba, Sri Vishnu Kumar Karlapati, Thomas Drugman

Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker.

Voice Conversion

Paper
Code

LSTM-based Whisper Detection

no code implementations • 20 Sep 2018 • Zeynab Raeesy, Kellen Gillespie, Zhenpei Yang, Chengyuan Ma, Thomas Drugman, Jiacheng Gu, Roland Maas, Ariya Rastrow, Björn Hoffmeister

We prove that, with enough data, the LSTM model is indeed as capable of learning whisper characteristics from LFBE features alone compared to a simpler MLP model that uses both LFBE and features engineered for separating whisper and normal speech.

Benchmarking

Paper
Add Code

Effect of data reduction on sequence-to-sequence neural TTS

no code implementations • 15 Nov 2018 • Javier Latorre, Jakub Lachowicz, Jaime Lorenzo-Trueba, Thomas Merritt, Thomas Drugman, Srikanth Ronanki, Klimkov Viacheslav

Recent speech synthesis systems based on sampling from autoregressive neural networks models can generate speech almost undistinguishable from human recordings.

Speech Synthesis

Paper
Add Code

Traditional Machine Learning for Pitch Detection

no code implementations • 4 Mar 2019 • Thomas Drugman, Goeric Huybrechts, Viacheslav Klimkov, Alexis Moinet

In this paper, we consider voicing detection as a classification problem and F0 contour estimation as a regression problem.

BIG-bench Machine Learning Clustering +1

Paper
Add Code

Active and Semi-Supervised Learning in ASR: Benefits on the Acoustic and Language Models

no code implementations • 7 Mar 2019 • Thomas Drugman, Janne Pylkkonen, Reinhard Kneser

The goal of this paper is to simulate the benefits of jointly applying active learning (AL) and semi-supervised training (SST) in a new speech recognition application.

Active Learning speech-recognition +1

Paper
Add Code

Fine-grained robust prosody transfer for single-speaker neural text-to-speech

no code implementations • 4 Jul 2019 • Viacheslav Klimkov, Srikanth Ronanki, Jonas Rohnke, Thomas Drugman

However, when trained on a single-speaker dataset, the conventional prosody transfer systems are not robust enough to speaker variability, especially in the case of a reference signal coming from an unseen speaker.

Paper
Add Code

Interpretable Deep Learning Model for the Detection and Reconstruction of Dysarthric Speech

no code implementations • 10 Jul 2019 • Daniel Korzekwa, Roberto Barra-Chicote, Bozena Kostek, Thomas Drugman, Mateusz Lajszczak

This paper proposed a novel approach for the detection and reconstruction of dysarthric speech.

Paper
Add Code

Dynamic Prosody Generation for Speech Synthesis using Linguistics-Driven Acoustic Embedding Selection

no code implementations • 2 Dec 2019 • Shubhi Tyagi, Marco Nicolis, Jonas Rohnke, Thomas Drugman, Jaime Lorenzo-Trueba

Recent advances in Text-to-Speech (TTS) have improved quality and naturalness to near-human capabilities when considering isolated sentences.

Speech Synthesis

Paper
Add Code

Voice Conversion for Whispered Speech Synthesis

no code implementations • 11 Dec 2019 • Marius Cotescu, Thomas Drugman, Goeric Huybrechts, Jaime Lorenzo-Trueba, Alexis Moinet

We present an approach to synthesize whisper by applying a handcrafted signal processing recipe and Voice Conversion (VC) techniques to convert normally phonated speech to whispered speech.

Speech Synthesis Voice Conversion

Paper
Add Code

Singing Synthesis: with a little help from my attention

no code implementations • 12 Dec 2019 • Orazio Angelini, Alexis Moinet, Kayoko Yanagisawa, Thomas Drugman

We present UTACO, a singing synthesis model based on an attention-based sequence-to-sequence mechanism and a vocoder based on dilated causal convolutions.

Paper
Add Code

Using a Pitch-Synchronous Residual Codebook for Hybrid HMM/Frame Selection Speech Synthesis

no code implementations • 30 Dec 2019 • Thomas Drugman, Alexis Moinet, Thierry Dutoit, Geoffrey Wilfart

The source signal is obtained by concatenating excitation frames picked up from the codebook, based on a selection criterion and taking target residual coefficients as input.

Speech Synthesis

Paper
Add Code

Causal-Anticausal Decomposition of Speech using Complex Cepstrum for Glottal Source Estimation

no code implementations • 30 Dec 2019 • Thomas Drugman, Baris Bozkurt, Thierry Dutoit

Via a systematic study of the windowing effects on the deconvolution quality, we show that the complex cepstrum causal-anticausal decomposition can be effectively used for glottal flow estimation when specific windowing criteria are met.

Paper
Add Code

Glottal Source Processing: from Analysis to Applications

no code implementations • 29 Dec 2019 • Thomas Drugman, Paavo Alku, Abeer Alwan, Bayya Yegnanarayana

The great majority of current voice technology applications relies on acoustic features characterizing the vocal tract response, such as the widely used MFCC of LPC parameters.

Paper
Add Code

Complex Cepstrum-based Decomposition of Speech for Glottal Source Estimation

no code implementations • 29 Dec 2019 • Thomas Drugman, Baris Bozkurt, Thierry Dutoit

Homomorphic analysis is a well-known method for the separation of non-linearly combined signals.

Paper
Add Code

Phase-based Information for Voice Pathology Detection

no code implementations • 2 Jan 2020 • Thomas Drugman, Thomas Dubuisson, Thierry Dutoit

In most current approaches of speech processing, information is extracted from the magnitude spectrum.

Paper
Add Code

Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics

no code implementations • 28 Dec 2019 • Thomas Drugman, Abeer Alwan

This paper focuses on the problem of pitch tracking in noisy conditions.

Paper
Add Code

Detection of Glottal Closure Instants from Speech Signals: a Quantitative Review

no code implementations • 28 Dec 2019 • Thomas Drugman, Mark Thomas, Jon Gudnason, Patrick Naylor, Thierry Dutoit

The five techniques compared are the Hilbert Envelope-based detection (HE), the Zero Frequency Resonator-based method (ZFR), the Dynamic Programming Phase Slope Algorithm (DYPSA), the Speech Event Detection using the Residual Excitation And a Mean-based Signal (SEDREAMS) and the Yet Another GCI Algorithm (YAGA).

Event Detection

Paper
Add Code

On the Mutual Information between Source and Filter Contributions for Voice Pathology Detection

no code implementations • 2 Jan 2020 • Thomas Drugman, Thomas Dubuisson, Thierry Dutoit

This paper addresses the problem of automatic detection of voice pathologies directly from the speech signal.

Paper
Add Code

Excitation-based Voice Quality Analysis and Modification

no code implementations • 2 Jan 2020 • Thomas Drugman, Thierry Dutoit, Baris Bozkurt

This paper investigates the differences occuring in the excitation for different voice qualities.

Speech Synthesis

Paper
Add Code

Eigenresiduals for improved Parametric Speech Synthesis

no code implementations • 2 Jan 2020 • Thomas Drugman, Geoffrey Wilfart, Thierry Dutoit

Statistical parametric speech synthesizers have recently shown their ability to produce natural-sounding and flexible voices.

Speech Synthesis

Paper
Add Code

A Comparative Evaluation of Pitch Modification Techniques

no code implementations • 2 Jan 2020 • Thomas Drugman, Thierry Dutoit

This paper addresses the problem of pitch modification, as an important module for an efficient voice transformation system.

Paper
Add Code

A Deterministic plus Stochastic Model of the Residual Signal for Improved Parametric Speech Synthesis

no code implementations • 29 Dec 2019 • Thomas Drugman, Geoffrey Wilfart, Thierry Dutoit

For this, we hereby propose an adaptation of the Deterministic plus Stochastic Model (DSM) for the residual.

Speech Synthesis

Paper
Add Code

A Comparative Study of Glottal Source Estimation Techniques

no code implementations • 28 Dec 2019 • Thomas Drugman, Baris Bozkurt, Thierry Dutoit

Techniques based on the mixed-phase decomposition and on a closed-phase inverse filtering process turn out to give the best results on both clean synthetic and real speech signals.

Paper
Add Code

Glottal Closure and Opening Instant Detection from Speech Signals

no code implementations • 28 Dec 2019 • Thomas Drugman, Thierry Dutoit

This paper proposes a new procedure to detect Glottal Closure and Opening Instants (GCIs and GOIs) directly from speech waveforms.

Position

Paper
Add Code

The Deterministic plus Stochastic Model of the Residual Signal and its Applications

no code implementations • 29 Dec 2019 • Thomas Drugman, Thierry Dutoit

The applicability of the DSM in two fields of speech processing is then studied.

Speaker Identification Speech Synthesis

Paper
Add Code

Chirp Complex Cepstrum-based Decomposition for Asynchronous Glottal Analysis

no code implementations • 10 May 2020 • Thomas Drugman, Thierry Dutoit

It was recently shown that complex cepstrum can be effectively used for glottal flow estimation by separating the causal and anticausal components of speech.

Paper
Add Code

Oscillating Statistical Moments for Speech Polarity Detection

no code implementations • 16 May 2020 • Thomas Drugman, Thierry Dutoit

An inversion of the speech polarity may have a dramatic detrimental effect on the performance of various techniques of speech processing.

Paper
Add Code

Glottal Source Estimation using an Automatic Chirp Decomposition

no code implementations • 16 May 2020 • Thomas Drugman, Baris Bozkurt, Thierry Dutoit

In a previous work, we showed that the glottal source can be estimated from speech signals by computing the Zeros of the Z-Transform (ZZT).

Paper
Add Code

Glottal source estimation robustness: A comparison of sensitivity of voice source estimation techniques

no code implementations • 24 May 2020 • Thomas Drugman, Thomas Dubuisson, Alexis Moinet, Nicolas D'Alessandro, Thierry Dutoit

This paper addresses the problem of estimating the voice source directly from speech waveforms.

Paper
Add Code

Data-driven Detection and Analysis of the Patterns of Creaky Voice

no code implementations • 31 May 2020 • Thomas Drugman, John Kane, Christer Gobl

This paper investigates the temporal excitation patterns of creaky voice.

Paper
Add Code

Maximum Voiced Frequency Estimation: Exploiting Amplitude and Phase Spectra

no code implementations • 31 May 2020 • Thomas Drugman, Yannis Stylianou

Recent studies have shown that its proper estimation and modeling enhance the quality of statistical parametric speech synthesizers.

Paper
Add Code

Residual Excitation Skewness for Automatic Speech Polarity Detection

no code implementations • 31 May 2020 • Thomas Drugman

Detecting the correct speech polarity is a necessary step prior to several speech processing techniques.

Paper
Add Code

Parametric Representation for Singing Voice Synthesis: a Comparative Evaluation

no code implementations • 7 Jun 2020 • Onur Babacan, Thomas Drugman, Tuomo Raitio, Daniel Erro, Thierry Dutoit

Various parametric representations have been proposed to model the speech signal.

Singing Voice Synthesis

Paper
Add Code

Maximum Phase Modeling for Sparse Linear Prediction of Speech

no code implementations • 7 Jun 2020 • Thomas Drugman

The proposed method is shown to significantly increase the sparsity of the LP residual signal and to be effective in two illustrative applications: speech polarity detection and excitation modeling.

Paper
Add Code

Analysis and Synthesis of Hypo and Hyperarticulated Speech

no code implementations • 7 Jun 2020 • Benjamin Picart, Thomas Drugman, Thierry Dutoit

This paper focuses on the analysis and synthesis of hypo and hyperarticulated speech in the framework of HMM-based speech synthesis.

Speech Synthesis

Paper
Add Code

Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech

no code implementations • 4 Nov 2020 • Sri Karlapati, Ammar Abbas, Zack Hodari, Alexis Moinet, Arnaud Joly, Penny Karanasou, Thomas Drugman

In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text.

Graph Attention Representation Learning +2

Paper
Add Code

EmoCat: Language-agnostic Emotional Voice Conversion

no code implementations • 14 Jan 2021 • Bastian Schnell, Goeric Huybrechts, Bartek Perz, Thomas Drugman, Jaime Lorenzo-Trueba

In this work we propose EmoCat, a language-agnostic emotional voice conversion model.

Voice Conversion

Paper
Add Code

Mispronunciation Detection in Non-native (L2) English with Uncertainty Modeling

no code implementations • 16 Jan 2021 • Daniel Korzekwa, Jaime Lorenzo-Trueba, Szymon Zaporowski, Shira Calamaro, Thomas Drugman, Bozena Kostek

A common approach to the automatic detection of mispronunciation in language learning is to recognize the phonemes produced by a student and compare it to the expected pronunciation of a native speaker.

Automatic Phoneme Recognition Sentence +1

Paper
Add Code

Weakly-supervised word-level pronunciation error detection in non-native English speech

no code implementations • 7 Jun 2021 • Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Shira Calamaro, Bozena Kostek

To train this model, phonetically transcribed L2 speech is not required and we only need to mark mispronounced words.

Paper
Add Code

Detection of Lexical Stress Errors in Non-Native (L2) English with Data Augmentation and Attention

no code implementations • 29 Dec 2020 • Daniel Korzekwa, Roberto Barra-Chicote, Szymon Zaporowski, Grzegorz Beringer, Jaime Lorenzo-Trueba, Alicja Serafinowicz, Jasha Droppo, Thomas Drugman, Bozena Kostek

This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS).

Data Augmentation

Paper
Add Code

A learned conditional prior for the VAE acoustic space of a TTS system

no code implementations • 14 Jun 2021 • Penny Karanasou, Sri Karlapati, Alexis Moinet, Arnaud Joly, Ammar Abbas, Simon Slangen, Jaime Lorenzo Trueba, Thomas Drugman

Many factors influence speech yielding different renditions of a given sentence.

Sentence

Paper
Add Code

Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

no code implementations • 29 Jun 2021 • Ammar Abbas, Bajibabu Bollepalli, Alexis Moinet, Arnaud Joly, Penny Karanasou, Peter Makarov, Simon Slangens, Sri Karlapati, Thomas Drugman

We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody.

Sentence

Paper
Add Code

Distribution augmentation for low-resource expressive text-to-speech

no code implementations • 13 Feb 2022 • Mateusz Lajszczak, Animesh Prasad, Arent van Korlaar, Bajibabu Bollepalli, Antonio Bonafonte, Arnaud Joly, Marco Nicolis, Alexis Moinet, Thomas Drugman, Trevor Wood, Elena Sokolova

This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data.

Data Augmentation

Paper
Add Code

CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer

no code implementations • 27 Jun 2022 • Sri Karlapati, Penny Karanasou, Mateusz Lajszczak, Ammar Abbas, Alexis Moinet, Peter Makarov, Ray Li, Arent van Korlaar, Simon Slangen, Thomas Drugman

In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing speech with different speaker identities, b) generating speech with expressive and contextually appropriate prosody, and c) transferring prosody at fine-grained level between any pair of seen speakers.

Paper
Add Code

Expressive, Variable, and Controllable Duration Modelling in TTS

no code implementations • 28 Jun 2022 • Ammar Abbas, Thomas Merritt, Alexis Moinet, Sri Karlapati, Ewa Muszynska, Simon Slangen, Elia Gatti, Thomas Drugman

First, we propose a duration model conditioned on phrasing that improves the predicted durations and provides better modelling of pauses.

Normalising Flows Speech Synthesis

Paper
Add Code

Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody

no code implementations • 29 Jun 2022 • Peter Makarov, Ammar Abbas, Mateusz Łajszczak, Arnaud Joly, Sri Karlapati, Alexis Moinet, Thomas Drugman, Penny Karanasou

In this paper, we examine simple extensions to a Transformer-based FastSpeech-like system, with the goal of improving prosody for multi-sentence TTS.

Language Modelling Sentence

Paper
Add Code

Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need

no code implementations • 2 Jul 2022 • Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Bozena Kostek

We show that these techniques not only improve the accuracy of three machine learning models for detecting pronunciation errors but also help establish a new state-of-the-art in the field.

Speech Synthesis

Paper
Add Code

eCat: An End-to-End Model for Multi-Speaker TTS & Many-to-Many Fine-Grained Prosody Transfer

no code implementations • 20 Jun 2023 • Ammar Abbas, Sri Karlapati, Bastian Schnell, Penny Karanasou, Marcel Granero Moya, Amith Nagaraj, Ayman Boustati, Nicole Peinelt, Alexis Moinet, Thomas Drugman

We show that eCat statistically significantly reduces the gap in naturalness between CopyCat2 and human recordings by an average of 46. 7% across 2 languages, 3 locales, and 7 speakers, along with better target-speaker similarity in FPT.

Paper
Add Code

Controllable Emphasis with zero data for text-to-speech

no code implementations • 13 Jul 2023 • Arnaud Joly, Marco Nicolis, Ekaterina Peterova, Alessandro Lombardi, Ammar Abbas, Arent van Korlaar, Aman Hussain, Parul Sharma, Alexis Moinet, Mateusz Lajszczak, Penny Karanasou, Antonio Bonafonte, Thomas Drugman, Elena Sokolova

We show that this technique significantly closes the gap to methods that require explicit recordings.

Sentence

Paper
Add Code

A Comparative Analysis of Pretrained Language Models for Text-to-Speech

no code implementations • 4 Sep 2023 • Marcel Granero-Moya, Penny Karanasou, Sri Karlapati, Bastian Schnell, Nicole Peinelt, Alexis Moinet, Thomas Drugman

In this study, we aim to address this gap by conducting a comparative analysis of different PLMs for two TTS tasks: prosody prediction and pause prediction.

Natural Language Understanding Prosody Prediction

Paper
Add Code

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

no code implementations • 12 Feb 2024 • Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszyńska, Haohan Guo, Bartosz Putrycz, Soledad López Gambino, Kayeon Yoo, Elena Sokolova, Thomas Drugman

Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences.

Disentanglement

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.