no code implementations • LREC 2022 • Siyang Wang, Joakim Gustafson, Éva Székely
Perceptual results show little difference between compared filler insertion models including with ground-truth, which may be due to the ambiguity of what is good filler insertion and a strong neural spontaneous TTS that produces natural speech irrespective of input.
no code implementations • 8 Oct 2023 • Shivam Mehta, Ruibo Tu, Simon Alexanderson, Jonas Beskow, Éva Székely, Gustav Eje Henter
As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures.
Ranked #1 on Motion Synthesis on Trinity Speech-Gesture Dataset
1 code implementation • 6 Sep 2023 • Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, Gustav Eje Henter
We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM).
Ranked #1 on Text-To-Speech Synthesis on LJSpeech (MOS metric)
no code implementations • 11 Jul 2023 • Siyang Wang, Gustav Eje Henter, Joakim Gustafson, Éva Székely
Prior work has shown that SSL is an effective intermediate representation in two-stage text-to-speech (TTS) for both read and spontaneous speech.
no code implementations • 15 Jun 2023 • Shivam Mehta, Siyang Wang, Simon Alexanderson, Jonas Beskow, Éva Székely, Gustav Eje Henter
With read-aloud speech synthesis achieving high naturalness scores, there is a growing research interest in synthesising spontaneous speech.
no code implementations • 29 May 2023 • Erik Ekstedt, Siyang Wang, Éva Székely, Joakim Gustafson, Gabriel Skantze
Turn-taking is a fundamental aspect of human communication where speakers convey their intention to either hold, or yield, their turn through prosodic cues.
no code implementations • 5 Mar 2023 • Siyang Wang, Gustav Eje Henter, Joakim Gustafson, Éva Székely
Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2. 0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms.
no code implementations • 24 Nov 2022 • Harm Lameris, Shivam Mehta, Gustav Eje Henter, Joakim Gustafson, Éva Székely
Spontaneous speech has many affective and pragmatic functions that are interesting and challenging to model in TTS.
2 code implementations • 13 Nov 2022 • Shivam Mehta, Ambika Kirkland, Harm Lameris, Jonas Beskow, Éva Székely, Gustav Eje Henter
Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech.
Ranked #11 on Text-To-Speech Synthesis on LJSpeech (using extra training data)
2 code implementations • 30 Aug 2021 • Shivam Mehta, Éva Székely, Jonas Beskow, Gustav Eje Henter
Neural sequence-to-sequence TTS has achieved significantly better output quality than statistical speech synthesis using HMMs.
Ranked #3 on Speech Synthesis on LJSpeech
1 code implementation • 25 Aug 2021 • Siyang Wang, Simon Alexanderson, Joakim Gustafson, Jonas Beskow, Gustav Eje Henter, Éva Székely
Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities, and applications merely stack the two technologies using a simple system-level pipeline.
no code implementations • 14 Jan 2021 • Simon Alexanderson, Éva Székely, Gustav Eje Henter, Taras Kucherenko, Jonas Beskow
In contrast to previous approaches for joint speech-and-gesture generation, we generate full-body gestures from speech synthesis trained on recordings of spontaneous speech from the same person as the motion-capture data.