no code implementations • 5 Feb 2024 • Álvaro Martín-Cortinas, Daniel Sáez-Trigueros, Iván Vallés-Pérez, Biel Tura-Vecino, Piotr Biliński, Mateusz Lajszczak, Grzegorz Beringer, Roberto Barra-Chicote, Jaime Lorenzo-Trueba
Using speaker-disentangled codes to train LLMs for text-to-speech (TTS) allows the LLM to generate the content and the style of the speech only from the text, similarly to humans, while the speaker identity is provided by the decoder of the VC model.
no code implementations • 31 Jul 2023 • Guangyan Zhang, Thomas Merritt, Manuel Sam Ribeiro, Biel Tura-Vecino, Kayoko Yanagisawa, Kamil Pokora, Abdelhamid Ezzerg, Sebastian Cygert, Ammar Abbas, Piotr Bilinski, Roberto Barra-Chicote, Daniel Korzekwa, Jaime Lorenzo-Trueba
Neural text-to-speech systems are often optimized on L1/L2 losses, which make strong assumptions about the distributions of the target data space.
no code implementations • 31 Jul 2023 • Manuel Sam Ribeiro, Giulia Comini, Jaime Lorenzo-Trueba
The G2P model is used to train a multilingual phone recognition system, which then decodes speech recordings with a phonetic representation.
no code implementations • 31 Jul 2023 • Giulia Comini, Manuel Sam Ribeiro, Fan Yang, Heereen Shim, Jaime Lorenzo-Trueba
Phonetic information and linguistic knowledge are an essential component of a Text-to-speech (TTS) front-end.
no code implementations • 29 Jul 2022 • Giulia Comini, Goeric Huybrechts, Manuel Sam Ribeiro, Adam Gabrys, Jaime Lorenzo-Trueba
The availability of data in expressive styles across languages is limited, and recording sessions are costly and time consuming.
no code implementations • 2 Jul 2022 • Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Bozena Kostek
We show that these techniques not only improve the accuracy of three machine learning models for detecting pronunciation errors but also help establish a new state-of-the-art in the field.
no code implementations • 16 Feb 2022 • Adam Gabryś, Goeric Huybrechts, Manuel Sam Ribeiro, Chung-Ming Chien, Julian Roth, Giulia Comini, Roberto Barra-Chicote, Bartek Perz, Jaime Lorenzo-Trueba
It uses voice conversion (VC) as a post-processing module appended to a pre-existing high-quality TTS system and marks a conceptual shift in the existing TTS paradigm, framing the few-shot TTS problem as a VC task.
no code implementations • 10 Feb 2022 • Manuel Sam Ribeiro, Julian Roth, Giulia Comini, Goeric Huybrechts, Adam Gabrys, Jaime Lorenzo-Trueba
The proposed approach relies on voice conversion to first generate high-quality data from the set of supporting expressive speakers.
no code implementations • 13 Aug 2021 • Abdelhamid Ezzerg, Adam Gabrys, Bartosz Putrycz, Daniel Korzekwa, Daniel Saez-Trigueros, David McHardy, Kamil Pokora, Jakub Lachowicz, Jaime Lorenzo-Trueba, Viacheslav Klimkov
Artificial speech synthesis has made a great leap in terms of naturalness as recent Text-to-Speech (TTS) systems are capable of producing speech with similar quality to human recordings.
1 code implementation • 16 Jun 2021 • Alejandro Mottini, Jaime Lorenzo-Trueba, Sri Vishnu Kumar Karlapati, Thomas Drugman
Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker.
no code implementations • 7 Jun 2021 • Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Shira Calamaro, Bozena Kostek
To train this model, phonetically transcribed L2 speech is not required and we only need to mark mispronounced words.
1 code implementation • NAACL 2021 • Shubhi Tyagi, Antonio Bonafonte, Jaime Lorenzo-Trueba, Javier Latorre
Developing Text Normalization (TN) systems for Text-to-Speech (TTS) on new languages is hard.
no code implementations • 16 Jan 2021 • Daniel Korzekwa, Jaime Lorenzo-Trueba, Szymon Zaporowski, Shira Calamaro, Thomas Drugman, Bozena Kostek
A common approach to the automatic detection of mispronunciation in language learning is to recognize the phonemes produced by a student and compare it to the expected pronunciation of a native speaker.
no code implementations • 14 Jan 2021 • Bastian Schnell, Goeric Huybrechts, Bartek Perz, Thomas Drugman, Jaime Lorenzo-Trueba
In this work we propose EmoCat, a language-agnostic emotional voice conversion model.
no code implementations • 29 Dec 2020 • Daniel Korzekwa, Roberto Barra-Chicote, Szymon Zaporowski, Grzegorz Beringer, Jaime Lorenzo-Trueba, Alicja Serafinowicz, Jasha Droppo, Thomas Drugman, Bozena Kostek
This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS).
no code implementations • 17 Dec 2020 • Jonas Rohnke, Tom Merritt, Jaime Lorenzo-Trueba, Adam Gabrys, Vatsal Aggarwal, Alexis Moinet, Roberto Barra-Chicote
In this paper we investigate the use of a sentence-level conditioning vector to improve the signal quality of a Parallel WaveNet neural vocoder.
no code implementations • 11 Nov 2020 • Goeric Huybrechts, Thomas Merritt, Giulia Comini, Bartek Perz, Raahil Shah, Jaime Lorenzo-Trueba
While recent neural text-to-speech (TTS) systems perform remarkably well, they typically require a substantial amount of recordings from the target speaker reading in the desired speaking style.
no code implementations • 11 Dec 2019 • Marius Cotescu, Thomas Drugman, Goeric Huybrechts, Jaime Lorenzo-Trueba, Alexis Moinet
We present an approach to synthesize whisper by applying a handcrafted signal processing recipe and Voice Conversion (VC) techniques to convert normally phonated speech to whispered speech.
no code implementations • 2 Dec 2019 • Shubhi Tyagi, Marco Nicolis, Jonas Rohnke, Thomas Drugman, Jaime Lorenzo-Trueba
Recent advances in Text-to-Speech (TTS) have improved quality and naturalness to near-human capabilities when considering isolated sentences.
no code implementations • 28 Nov 2019 • Vatsal Aggarwal, Marius Cotescu, Nishant Prateek, Jaime Lorenzo-Trueba, Roberto Barra-Chicote
We propose a Text-to-Speech method to create an unseen expressive style using one utterance of expressive speech of around one second.
1 code implementation • 10 Nov 2019 • Seyyed Saeed Sarfjoo, Xin Wang, Gustav Eje Henter, Jaime Lorenzo-Trueba, Shinji Takaki, Junichi Yamagishi
Nowadays vast amounts of speech data are recorded from low-quality recorder devices such as smartphones, tablets, laptops, and medium-quality microphones.
Sound Audio and Speech Processing
1 code implementation • 4 Jul 2019 • Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote, Alexis Moinet, Vatsal Aggarwal
This vocoder is shown to be capable of generating speech of consistently good quality (98% relative mean MUSHRA when compared to natural speech) regardless of whether the input spectrogram comes from a speaker or style seen during training or from an out-of-domain scenario when the recording conditions are studio-quality.
1 code implementation • NAACL 2019 • Nishant Prateek, Mateusz Łajszczak, Roberto Barra-Chicote, Thomas Drugman, Jaime Lorenzo-Trueba, Thomas Merritt, Srikanth Ronanki, Trevor Wood
Neural text-to-speech synthesis (NTTS) models have shown significant progress in generating high-quality speech, however they require a large quantity of training data.
8 code implementations • 15 Nov 2018 • Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote
This paper introduces a robust universal neural vocoder trained with 74 speakers (comprised of both genders) coming from 17 languages.
no code implementations • 15 Nov 2018 • Javier Latorre, Jakub Lachowicz, Jaime Lorenzo-Trueba, Thomas Merritt, Thomas Drugman, Srikanth Ronanki, Klimkov Viacheslav
Recent speech synthesis systems based on sampling from autoregressive neural networks models can generate speech almost undistinguishable from human recordings.
no code implementations • 30 Jul 2018 • Gustav Eje Henter, Jaime Lorenzo-Trueba, Xin Wang, Junichi Yamagishi
Generating versatile and appropriate synthetic speech requires control over the output expression separate from the spoken text.
no code implementations • 23 Apr 2018 • Tomi Kinnunen, Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Zhen-Hua Ling
As a supplement to subjective results for the 2018 Voice Conversion Challenge (VCC'18) data, we configure a standard constant-Q cepstral coefficient CM to quantify the extent of processing artifacts.
no code implementations • 12 Apr 2018 • Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kinnunen, Zhen-Hua Ling
We present the Voice Conversion Challenge 2018, designed as a follow up to the 2016 edition with the aim of providing a common framework for evaluating and comparing different state-of-the-art voice conversion (VC) systems.
no code implementations • 7 Apr 2018 • Xin Wang, Jaime Lorenzo-Trueba, Shinji Takaki, Lauri Juvela, Junichi Yamagishi
Recent advances in speech synthesis suggest that limitations such as the lossy nature of the amplitude spectrum with minimum phase approximation and the over-smoothing effect in acoustic modeling can be overcome by using advanced machine learning approaches.
no code implementations • 2 Apr 2018 • Fuming Fang, Junichi Yamagishi, Isao Echizen, Jaime Lorenzo-Trueba
Although voice conversion (VC) algorithms have achieved remarkable success along with the development of machine learning, superior performance is still difficult to achieve when using nonparallel data.
Generative Adversarial Network
Image-to-Image Translation
+4
no code implementations • 2 Mar 2018 • Jaime Lorenzo-Trueba, Fuming Fang, Xin Wang, Isao Echizen, Junichi Yamagishi, Tomi Kinnunen
Thanks to the growing availability of spoofing databases and rapid advances in using them, systems for detecting voice spoofing attacks are becoming more and more capable, and error rates close to zero are being reached for the ASVspoof2015 database.
no code implementations • COLING 2016 • Jaime Lorenzo-Trueba, Roberto Barra-Chicote, Ascension Gallardo-Antolin, Junichi Yamagishi, Juan M. Montero
This paper introduces a continuous system capable of automatically producing the most adequate speaking style to synthesize a desired target text.