Search Results for author: Thomas Drugman

Found 54 papers, 4 papers with code

Robust universal neural vocoding

8 code implementations15 Nov 2018 Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote

This paper introduces a robust universal neural vocoder trained with 74 speakers (comprised of both genders) coming from 17 languages.

Towards achieving robust universal neural vocoding

1 code implementation4 Jul 2019 Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote, Alexis Moinet, Vatsal Aggarwal

This vocoder is shown to be capable of generating speech of consistently good quality (98% relative mean MUSHRA when compared to natural speech) regardless of whether the input spectrogram comes from a speaker or style seen during training or from an out-of-domain scenario when the recording conditions are studio-quality.

Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments

1 code implementation16 Jun 2021 Alejandro Mottini, Jaime Lorenzo-Trueba, Sri Vishnu Kumar Karlapati, Thomas Drugman

Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker.

Voice Conversion

LSTM-based Whisper Detection

no code implementations20 Sep 2018 Zeynab Raeesy, Kellen Gillespie, Zhenpei Yang, Chengyuan Ma, Thomas Drugman, Jiacheng Gu, Roland Maas, Ariya Rastrow, Björn Hoffmeister

We prove that, with enough data, the LSTM model is indeed as capable of learning whisper characteristics from LFBE features alone compared to a simpler MLP model that uses both LFBE and features engineered for separating whisper and normal speech.

Benchmarking

Effect of data reduction on sequence-to-sequence neural TTS

no code implementations15 Nov 2018 Javier Latorre, Jakub Lachowicz, Jaime Lorenzo-Trueba, Thomas Merritt, Thomas Drugman, Srikanth Ronanki, Klimkov Viacheslav

Recent speech synthesis systems based on sampling from autoregressive neural networks models can generate speech almost undistinguishable from human recordings.

Speech Synthesis

Traditional Machine Learning for Pitch Detection

no code implementations4 Mar 2019 Thomas Drugman, Goeric Huybrechts, Viacheslav Klimkov, Alexis Moinet

In this paper, we consider voicing detection as a classification problem and F0 contour estimation as a regression problem.

BIG-bench Machine Learning Clustering +1

Active and Semi-Supervised Learning in ASR: Benefits on the Acoustic and Language Models

no code implementations7 Mar 2019 Thomas Drugman, Janne Pylkkonen, Reinhard Kneser

The goal of this paper is to simulate the benefits of jointly applying active learning (AL) and semi-supervised training (SST) in a new speech recognition application.

Active Learning speech-recognition +1

Fine-grained robust prosody transfer for single-speaker neural text-to-speech

no code implementations4 Jul 2019 Viacheslav Klimkov, Srikanth Ronanki, Jonas Rohnke, Thomas Drugman

However, when trained on a single-speaker dataset, the conventional prosody transfer systems are not robust enough to speaker variability, especially in the case of a reference signal coming from an unseen speaker.

Dynamic Prosody Generation for Speech Synthesis using Linguistics-Driven Acoustic Embedding Selection

no code implementations2 Dec 2019 Shubhi Tyagi, Marco Nicolis, Jonas Rohnke, Thomas Drugman, Jaime Lorenzo-Trueba

Recent advances in Text-to-Speech (TTS) have improved quality and naturalness to near-human capabilities when considering isolated sentences.

Speech Synthesis

Voice Conversion for Whispered Speech Synthesis

no code implementations11 Dec 2019 Marius Cotescu, Thomas Drugman, Goeric Huybrechts, Jaime Lorenzo-Trueba, Alexis Moinet

We present an approach to synthesize whisper by applying a handcrafted signal processing recipe and Voice Conversion (VC) techniques to convert normally phonated speech to whispered speech.

Speech Synthesis Voice Conversion

Singing Synthesis: with a little help from my attention

no code implementations12 Dec 2019 Orazio Angelini, Alexis Moinet, Kayoko Yanagisawa, Thomas Drugman

We present UTACO, a singing synthesis model based on an attention-based sequence-to-sequence mechanism and a vocoder based on dilated causal convolutions.

Using a Pitch-Synchronous Residual Codebook for Hybrid HMM/Frame Selection Speech Synthesis

no code implementations30 Dec 2019 Thomas Drugman, Alexis Moinet, Thierry Dutoit, Geoffrey Wilfart

The source signal is obtained by concatenating excitation frames picked up from the codebook, based on a selection criterion and taking target residual coefficients as input.

Speech Synthesis

Causal-Anticausal Decomposition of Speech using Complex Cepstrum for Glottal Source Estimation

no code implementations30 Dec 2019 Thomas Drugman, Baris Bozkurt, Thierry Dutoit

Via a systematic study of the windowing effects on the deconvolution quality, we show that the complex cepstrum causal-anticausal decomposition can be effectively used for glottal flow estimation when specific windowing criteria are met.

Glottal Source Processing: from Analysis to Applications

no code implementations29 Dec 2019 Thomas Drugman, Paavo Alku, Abeer Alwan, Bayya Yegnanarayana

The great majority of current voice technology applications relies on acoustic features characterizing the vocal tract response, such as the widely used MFCC of LPC parameters.

Complex Cepstrum-based Decomposition of Speech for Glottal Source Estimation

no code implementations29 Dec 2019 Thomas Drugman, Baris Bozkurt, Thierry Dutoit

Homomorphic analysis is a well-known method for the separation of non-linearly combined signals.

Phase-based Information for Voice Pathology Detection

no code implementations2 Jan 2020 Thomas Drugman, Thomas Dubuisson, Thierry Dutoit

In most current approaches of speech processing, information is extracted from the magnitude spectrum.

Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics

no code implementations28 Dec 2019 Thomas Drugman, Abeer Alwan

This paper focuses on the problem of pitch tracking in noisy conditions.

Detection of Glottal Closure Instants from Speech Signals: a Quantitative Review

no code implementations28 Dec 2019 Thomas Drugman, Mark Thomas, Jon Gudnason, Patrick Naylor, Thierry Dutoit

The five techniques compared are the Hilbert Envelope-based detection (HE), the Zero Frequency Resonator-based method (ZFR), the Dynamic Programming Phase Slope Algorithm (DYPSA), the Speech Event Detection using the Residual Excitation And a Mean-based Signal (SEDREAMS) and the Yet Another GCI Algorithm (YAGA).

Event Detection

On the Mutual Information between Source and Filter Contributions for Voice Pathology Detection

no code implementations2 Jan 2020 Thomas Drugman, Thomas Dubuisson, Thierry Dutoit

This paper addresses the problem of automatic detection of voice pathologies directly from the speech signal.

Excitation-based Voice Quality Analysis and Modification

no code implementations2 Jan 2020 Thomas Drugman, Thierry Dutoit, Baris Bozkurt

This paper investigates the differences occuring in the excitation for different voice qualities.

Speech Synthesis

Eigenresiduals for improved Parametric Speech Synthesis

no code implementations2 Jan 2020 Thomas Drugman, Geoffrey Wilfart, Thierry Dutoit

Statistical parametric speech synthesizers have recently shown their ability to produce natural-sounding and flexible voices.

Speech Synthesis

A Comparative Evaluation of Pitch Modification Techniques

no code implementations2 Jan 2020 Thomas Drugman, Thierry Dutoit

This paper addresses the problem of pitch modification, as an important module for an efficient voice transformation system.

A Deterministic plus Stochastic Model of the Residual Signal for Improved Parametric Speech Synthesis

no code implementations29 Dec 2019 Thomas Drugman, Geoffrey Wilfart, Thierry Dutoit

For this, we hereby propose an adaptation of the Deterministic plus Stochastic Model (DSM) for the residual.

Speech Synthesis

A Comparative Study of Glottal Source Estimation Techniques

no code implementations28 Dec 2019 Thomas Drugman, Baris Bozkurt, Thierry Dutoit

Techniques based on the mixed-phase decomposition and on a closed-phase inverse filtering process turn out to give the best results on both clean synthetic and real speech signals.

Glottal Closure and Opening Instant Detection from Speech Signals

no code implementations28 Dec 2019 Thomas Drugman, Thierry Dutoit

This paper proposes a new procedure to detect Glottal Closure and Opening Instants (GCIs and GOIs) directly from speech waveforms.

Position

Chirp Complex Cepstrum-based Decomposition for Asynchronous Glottal Analysis

no code implementations10 May 2020 Thomas Drugman, Thierry Dutoit

It was recently shown that complex cepstrum can be effectively used for glottal flow estimation by separating the causal and anticausal components of speech.

Oscillating Statistical Moments for Speech Polarity Detection

no code implementations16 May 2020 Thomas Drugman, Thierry Dutoit

An inversion of the speech polarity may have a dramatic detrimental effect on the performance of various techniques of speech processing.

Glottal Source Estimation using an Automatic Chirp Decomposition

no code implementations16 May 2020 Thomas Drugman, Baris Bozkurt, Thierry Dutoit

In a previous work, we showed that the glottal source can be estimated from speech signals by computing the Zeros of the Z-Transform (ZZT).

Data-driven Detection and Analysis of the Patterns of Creaky Voice

no code implementations31 May 2020 Thomas Drugman, John Kane, Christer Gobl

This paper investigates the temporal excitation patterns of creaky voice.

Maximum Voiced Frequency Estimation: Exploiting Amplitude and Phase Spectra

no code implementations31 May 2020 Thomas Drugman, Yannis Stylianou

Recent studies have shown that its proper estimation and modeling enhance the quality of statistical parametric speech synthesizers.

Residual Excitation Skewness for Automatic Speech Polarity Detection

no code implementations31 May 2020 Thomas Drugman

Detecting the correct speech polarity is a necessary step prior to several speech processing techniques.

Maximum Phase Modeling for Sparse Linear Prediction of Speech

no code implementations7 Jun 2020 Thomas Drugman

The proposed method is shown to significantly increase the sparsity of the LP residual signal and to be effective in two illustrative applications: speech polarity detection and excitation modeling.

Analysis and Synthesis of Hypo and Hyperarticulated Speech

no code implementations7 Jun 2020 Benjamin Picart, Thomas Drugman, Thierry Dutoit

This paper focuses on the analysis and synthesis of hypo and hyperarticulated speech in the framework of HMM-based speech synthesis.

Speech Synthesis

Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech

no code implementations4 Nov 2020 Sri Karlapati, Ammar Abbas, Zack Hodari, Alexis Moinet, Arnaud Joly, Penny Karanasou, Thomas Drugman

In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text.

Graph Attention Representation Learning +2

Mispronunciation Detection in Non-native (L2) English with Uncertainty Modeling

no code implementations16 Jan 2021 Daniel Korzekwa, Jaime Lorenzo-Trueba, Szymon Zaporowski, Shira Calamaro, Thomas Drugman, Bozena Kostek

A common approach to the automatic detection of mispronunciation in language learning is to recognize the phonemes produced by a student and compare it to the expected pronunciation of a native speaker.

Automatic Phoneme Recognition Sentence +1

Weakly-supervised word-level pronunciation error detection in non-native English speech

no code implementations7 Jun 2021 Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Shira Calamaro, Bozena Kostek

To train this model, phonetically transcribed L2 speech is not required and we only need to mark mispronounced words.

Detection of Lexical Stress Errors in Non-Native (L2) English with Data Augmentation and Attention

no code implementations29 Dec 2020 Daniel Korzekwa, Roberto Barra-Chicote, Szymon Zaporowski, Grzegorz Beringer, Jaime Lorenzo-Trueba, Alicja Serafinowicz, Jasha Droppo, Thomas Drugman, Bozena Kostek

This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS).

Data Augmentation

Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

no code implementations29 Jun 2021 Ammar Abbas, Bajibabu Bollepalli, Alexis Moinet, Arnaud Joly, Penny Karanasou, Peter Makarov, Simon Slangens, Sri Karlapati, Thomas Drugman

We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody.

Sentence

Distribution augmentation for low-resource expressive text-to-speech

no code implementations13 Feb 2022 Mateusz Lajszczak, Animesh Prasad, Arent van Korlaar, Bajibabu Bollepalli, Antonio Bonafonte, Arnaud Joly, Marco Nicolis, Alexis Moinet, Thomas Drugman, Trevor Wood, Elena Sokolova

This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data.

Data Augmentation

CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer

no code implementations27 Jun 2022 Sri Karlapati, Penny Karanasou, Mateusz Lajszczak, Ammar Abbas, Alexis Moinet, Peter Makarov, Ray Li, Arent van Korlaar, Simon Slangen, Thomas Drugman

In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing speech with different speaker identities, b) generating speech with expressive and contextually appropriate prosody, and c) transferring prosody at fine-grained level between any pair of seen speakers.

Expressive, Variable, and Controllable Duration Modelling in TTS

no code implementations28 Jun 2022 Ammar Abbas, Thomas Merritt, Alexis Moinet, Sri Karlapati, Ewa Muszynska, Simon Slangen, Elia Gatti, Thomas Drugman

First, we propose a duration model conditioned on phrasing that improves the predicted durations and provides better modelling of pauses.

Normalising Flows Speech Synthesis

Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody

no code implementations29 Jun 2022 Peter Makarov, Ammar Abbas, Mateusz Łajszczak, Arnaud Joly, Sri Karlapati, Alexis Moinet, Thomas Drugman, Penny Karanasou

In this paper, we examine simple extensions to a Transformer-based FastSpeech-like system, with the goal of improving prosody for multi-sentence TTS.

Language Modelling Sentence

Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need

no code implementations2 Jul 2022 Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Bozena Kostek

We show that these techniques not only improve the accuracy of three machine learning models for detecting pronunciation errors but also help establish a new state-of-the-art in the field.

Speech Synthesis

eCat: An End-to-End Model for Multi-Speaker TTS & Many-to-Many Fine-Grained Prosody Transfer

no code implementations20 Jun 2023 Ammar Abbas, Sri Karlapati, Bastian Schnell, Penny Karanasou, Marcel Granero Moya, Amith Nagaraj, Ayman Boustati, Nicole Peinelt, Alexis Moinet, Thomas Drugman

We show that eCat statistically significantly reduces the gap in naturalness between CopyCat2 and human recordings by an average of 46. 7% across 2 languages, 3 locales, and 7 speakers, along with better target-speaker similarity in FPT.

A Comparative Analysis of Pretrained Language Models for Text-to-Speech

no code implementations4 Sep 2023 Marcel Granero-Moya, Penny Karanasou, Sri Karlapati, Bastian Schnell, Nicole Peinelt, Alexis Moinet, Thomas Drugman

In this study, we aim to address this gap by conducting a comparative analysis of different PLMs for two TTS tasks: prosody prediction and pause prediction.

Natural Language Understanding Prosody Prediction

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

no code implementations12 Feb 2024 Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszyńska, Haohan Guo, Bartosz Putrycz, Soledad López Gambino, Kayeon Yoo, Elena Sokolova, Thomas Drugman

Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences.

Disentanglement

Cannot find the paper you are looking for? You can Submit a new open access paper.