Search Results for author: Gustav Eje Henter

Found 40 papers, 20 papers with code

Exploring Internal Numeracy in Language Models: A Case Study on ALBERT

no code implementations • 25 Apr 2024 • Ulme Wennberg, Gustav Eje Henter

It has been found that Transformer-based language models have the ability to perform basic quantitative reasoning.

Paper
Add Code

Unified speech and gesture synthesis using flow matching

no code implementations • 8 Oct 2023 • Shivam Mehta, Ruibo Tu, Simon Alexanderson, Jonas Beskow, Éva Székely, Gustav Eje Henter

As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures.

Ranked #1 on Motion Synthesis on Trinity Speech-Gesture Dataset

Audio Synthesis Motion Synthesis +1

Paper
Add Code

Matcha-TTS: A fast TTS architecture with conditional flow matching

1 code implementation • 6 Sep 2023 • Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, Gustav Eje Henter

We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM).

Ranked #1 on Text-To-Speech Synthesis on LJSpeech (MOS metric)

Acoustic Modelling Speech Synthesis +1

386

Paper
Code

The GENEA Challenge 2023: A large scale evaluation of gesture generation models in monadic and dyadic settings

2 code implementations • 24 Aug 2023 • Taras Kucherenko, Rajmund Nagy, Youngwoo Yoon, Jieyeon Woo, Teodor Nikolov, Mihail Tsakov, Gustav Eje Henter

The effect of the interlocutor is even more subtle, with submitted systems at best performing barely above chance.

Gesture Generation

121

Paper
Code

On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis

no code implementations • 11 Jul 2023 • Siyang Wang, Gustav Eje Henter, Joakim Gustafson, Éva Székely

Prior work has shown that SSL is an effective intermediate representation in two-stage text-to-speech (TTS) for both read and spontaneous speech.

Self-Supervised Learning Speech Synthesis

Paper
Add Code

Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis

no code implementations • 15 Jun 2023 • Shivam Mehta, Siyang Wang, Simon Alexanderson, Jonas Beskow, Éva Székely, Gustav Eje Henter

With read-aloud speech synthesis achieving high naturalness scores, there is a growing research interest in synthesising spontaneous speech.

Denoising Speech Synthesis

Paper
Add Code

Speaker-independent neural formant synthesis

no code implementations • 2 Jun 2023 • Pablo Pérez Zarazaga, Zofia Malisz, Gustav Eje Henter, Lauri Juvela

We also find that the small set of phonetically relevant speech parameters we use is sufficient to allow for speaker-independent synthesis (a. k. a.

Speech Synthesis

Paper
Add Code

Evaluating gesture generation in a large-scale open challenge: The GENEA Challenge 2022

no code implementations • 15 Mar 2023 • Taras Kucherenko, Pieter Wolfert, Youngwoo Yoon, Carla Viegas, Teodor Nikolov, Mihail Tsakov, Gustav Eje Henter

For each tier, we evaluated both the human-likeness of the gesture motion and its appropriateness for the specific speech signal.

Gesture Generation

Paper
Add Code

A processing framework to access large quantities of whispered speech found in ASMR

no code implementations • 13 Mar 2023 • Pablo Perez Zarazaga, Gustav Eje Henter, Zofia Malisz

Whispering is a ubiquitous mode of communication that humans use daily.

Action Detection Activity Detection

Paper
Add Code

A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS

no code implementations • 5 Mar 2023 • Siyang Wang, Gustav Eje Henter, Joakim Gustafson, Éva Székely

Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2. 0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms.

Self-Supervised Learning

Paper
Add Code

Context-specific kernel-based hidden Markov model for time series analysis

1 code implementation • 24 Jan 2023 • Carlos Puerto-Santana, Concha Bielza, Pedro Larrañaga, Gustav Eje Henter

Traditional hidden Markov models have been a useful tool to understand and model stochastic dynamic data; in the case of non-Gaussian data, models such as mixture of Gaussian hidden Markov models can be used.

Density Estimation Time Series +1

Paper
Code

A Comprehensive Review of Data-Driven Co-Speech Gesture Generation

no code implementations • 13 Jan 2023 • Simbarashe Nyatsanga, Taras Kucherenko, Chaitanya Ahuja, Gustav Eje Henter, Michael Neff

Finally, we identify key research challenges in gesture generation, including data availability and quality; producing human-like motion; grounding the gesture in the co-occurring speech in interaction with other speakers, and in the environment; performing gesture evaluation; and integration of gesture synthesis into applications.

Gesture Generation

Paper
Add Code

Prosody-controllable spontaneous TTS with neural HMMs

no code implementations • 24 Nov 2022 • Harm Lameris, Shivam Mehta, Gustav Eje Henter, Joakim Gustafson, Éva Székely

Spontaneous speech has many affective and pragmatic functions that are interesting and challenging to model in TTS.

valid

Paper
Add Code

Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models

1 code implementation • 17 Nov 2022 • Simon Alexanderson, Rajmund Nagy, Jonas Beskow, Gustav Eje Henter

Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models.

Gesture Generation Motion Synthesis

121

Paper
Code

OverFlow: Putting flows on top of neural transducers for better TTS

2 code implementations • 13 Nov 2022 • Shivam Mehta, Ambika Kirkland, Harm Lameris, Jonas Beskow, Éva Székely, Gustav Eje Henter

Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech.

Ranked #11 on Text-To-Speech Synthesis on LJSpeech (using extra training data)

Normalising Flows Speech Synthesis +1

29,277

Paper
Code

Predicting pairwise preferences between TTS audio stimuli using parallel ratings data and anti-symmetric twin neural networks

1 code implementation • 22 Sep 2022 • Cassia Valentini-Botinhao, Manuel Sam Ribeiro, Oliver Watts, Korin Richmond, Gustav Eje Henter

While previous work has focused on predicting listeners' ratings (mean opinion scores) of individual stimuli, we focus on the simpler task of predicting subjective preference given two speech stimuli for the same text.

Paper
Code

The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation

3 code implementations • 22 Aug 2022 • Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, Carla Viegas, Teodor Nikolov, Mihail Tsakov, Gustav Eje Henter

On the other hand, all synthetic motion is found to be vastly less appropriate for the speech than the original motion-capture recordings.

Gesture Generation

Paper
Code

Wavebender GAN: An architecture for phonetically meaningful speech manipulation

no code implementations • 22 Feb 2022 • Gustavo Teodoro Döhler Beck, Ulme Wennberg, Zofia Malisz, Gustav Eje Henter

Deep learning has revolutionised synthetic speech quality.

Paper
Add Code

Neural HMMs are all you need (for high-quality attention-free TTS)

2 code implementations • 30 Aug 2021 • Shivam Mehta, Éva Székely, Jonas Beskow, Gustav Eje Henter

Neural sequence-to-sequence TTS has achieved significantly better output quality than statistical speech synthesis using HMMs.

Ranked #3 on Speech Synthesis on LJSpeech

Speech Synthesis

29,277

Paper
Code

Integrated Speech and Gesture Synthesis

1 code implementation • 25 Aug 2021 • Siyang Wang, Simon Alexanderson, Joakim Gustafson, Jonas Beskow, Gustav Eje Henter, Éva Székely

Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities, and applications merely stack the two technologies using a simple system-level pipeline.

Speech Synthesis

Paper
Code

Multimodal analysis of the predictability of hand-gesture properties

no code implementations • 12 Aug 2021 • Taras Kucherenko, Rajmund Nagy, Michael Neff, Hedvig Kjellström, Gustav Eje Henter

Embodied conversational agents benefit from being able to accompany their speech with gestures.

Gesture Generation

Paper
Add Code

Normalizing Flow based Hidden Markov Models for Classification of Speech Phones with Explainability

1 code implementation • 1 Jul 2021 • Anubhab Ghosh, Antoine Honoré, Dong Liu, Gustav Eje Henter, Saikat Chatterjee

For a standard speech phone classification setup involving 39 phones (classes) and the TIMIT dataset, we show that the use of standard features called mel-frequency-cepstral-coeffcients (MFCCs), the proposed generative models, and the decision fusion together can achieve $86. 6\%$ accuracy by generative training only.

Classification

Paper
Code

Speech2Properties2Gestures: Gesture-Property Prediction as a Tool for Generating Representational Gestures from Speech

no code implementations • 28 Jun 2021 • Taras Kucherenko, Rajmund Nagy, Patrik Jonell, Michael Neff, Hedvig Kjellström, Gustav Eje Henter

We propose a new framework for gesture generation, aiming to allow data-driven approaches to produce more semantically rich gestures.

Gesture Generation Property Prediction

Paper
Add Code

Transflower: probabilistic autoregressive dance generation with multimodal attention

no code implementations • 25 Jun 2021 • Guillermo Valle-Pérez, Gustav Eje Henter, Jonas Beskow, André Holzapfel, Pierre-Yves Oudeyer, Simon Alexanderson

First, we present a novel probabilistic autoregressive architecture that models the distribution over future poses with a normalizing flow conditioned on previous poses as well as music context, using a multimodal transformer encoder.

Paper
Add Code

The Case for Translation-Invariant Self-Attention in Transformer-Based Language Models

1 code implementation • ACL 2021 • Ulme Wennberg, Gustav Eje Henter

In this paper, we analyze the position embeddings of existing language models, finding strong evidence of translation invariance, both for the embeddings themselves and for their effect on self-attention.

Position Translation

Paper
Code

Robust Classification using Hidden Markov Models and Mixtures of Normalizing Flows

no code implementations • 15 Feb 2021 • Anubhab Ghosh, Antoine Honoré, Dong Liu, Gustav Eje Henter, Saikat Chatterjee

We test the robustness of a maximum-likelihood (ML) based classifier where sequential data as observation is corrupted by noise.

General Classification Robust classification +2

Paper
Add Code

Generating coherent spontaneous speech and gesture from text

no code implementations • 14 Jan 2021 • Simon Alexanderson, Éva Székely, Gustav Eje Henter, Taras Kucherenko, Jonas Beskow

In contrast to previous approaches for joint speech-and-gesture generation, we generate full-body gestures from speech synthesis trained on recordings of spontaneous speech from the same person as the motion-capture data.

Gesture Generation Speech Synthesis

Paper
Add Code

Full-Glow: Fully conditional Glow for more realistic image generation

1 code implementation • 10 Dec 2020 • Moein Sorkhei, Gustav Eje Henter, Hedvig Kjellström

Autonomous agents, such as driverless cars, require large amounts of labeled visual data for their training.

Image Generation Object Recognition +2

Paper
Code

Moving fast and slow: Analysis of representations and post-processing in speech-driven automatic gesture generation

1 code implementation • 16 Jul 2020 • Taras Kucherenko, Dai Hasegawa, Naoshi Kaneko, Gustav Eje Henter, Hedvig Kjellström

We provide an analysis of different representations for the input (speech) and the output (motion) of the network by both objective and subjective evaluations.

Gesture Generation Representation Learning

103

Paper
Code

Let's Face It: Probabilistic Multi-modal Interlocutor-aware Generation of Facial Gestures in Dyadic Settings

1 code implementation • 11 Jun 2020 • Patrik Jonell, Taras Kucherenko, Gustav Eje Henter, Jonas Beskow

Our contributions are: a) a method for feature extraction from multi-party video and speech recordings, resulting in a representation that allows for independent control and manipulation of expression and speech articulation in a 3D avatar; b) an extension to MoGlow, a recent motion-synthesis method based on normalizing flows, to also take multi-modal signals from the interlocutor as input and subsequently output interlocutor-aware facial gestures; and c) a subjective evaluation assessing the use and relative importance of the input modalities.

Motion Synthesis

Paper
Code

Robust model training and generalisation with Studentising flows

1 code implementation • 11 Jun 2020 • Simon Alexanderson, Gustav Eje Henter

Normalising flows are tractable probabilistic models that leverage the power of deep learning to describe a wide parametric family of distributions, all while remaining trainable using maximum likelihood.

Normalising Flows

466

Paper
Code

Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows

1 code implementation • Computer Graphics Forum 2020 • Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, Jonas Beskow

In interactive scenarios, systems for generating natural animations on the fly are key to achieving believable and relatable characters.

Gesture Generation Motion Synthesis +1

466

Paper
Code

Gesticulator: A framework for semantically-aware speech-driven gesture generation

1 code implementation • 25 Jan 2020 • Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexanderson, Iolanda Leite, Hedvig Kjellström

During speech, people spontaneously gesticulate, which plays a key role in conveying information.

Gesture Generation

119

Paper
Code

Transformation of low-quality device-recorded speech to high-quality speech using improved SEGAN model

1 code implementation • 10 Nov 2019 • Seyyed Saeed Sarfjoo, Xin Wang, Gustav Eje Henter, Jaime Lorenzo-Trueba, Shinji Takaki, Junichi Yamagishi

Nowadays vast amounts of speech data are recorded from low-quality recorder devices such as smartphones, tablets, laptops, and medium-quality microphones.

Sound Audio and Speech Processing

Paper
Code

MoGlow: Probabilistic and controllable motion synthesis using normalising flows

3 code implementations • 16 May 2019 • Gustav Eje Henter, Simon Alexanderson, Jonas Beskow

Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, and social robotics.

Motion Synthesis Normalising Flows

504

Paper
Code

Analyzing Input and Output Representations for Speech-Driven Gesture Generation

1 code implementation • arXiv 2019 • Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, Hedvig Kjellström

We evaluate different representation sizes in order to find the most effective dimensionality for the representation.

Gesture Generation Human-Computer Interaction I.2.6; I.5.1; J.4

103

Paper
Code

Kernel Density Estimation-Based Markov Models with Hidden State

no code implementations • 30 Jul 2018 • Gustav Eje Henter, Arne Leijon, W. Bastiaan Kleijn

We consider Markov models of stochastic processes where the next-step conditional distribution is defined by a kernel density estimator (KDE), similar to Markov forecast densities and certain time-series bootstrap schemes.

Density Estimation Time Series Analysis

Paper
Add Code

Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis

no code implementations • 30 Jul 2018 • Gustav Eje Henter, Jaime Lorenzo-Trueba, Xin Wang, Junichi Yamagishi

Generating versatile and appropriate synthetic speech requires control over the output expression separate from the spoken text.

Acoustic Modelling Emotional Speech Synthesis +1

Paper
Add Code

Consensus-based Sequence Training for Video Captioning

no code implementations • 27 Dec 2017 • Sang Phan, Gustav Eje Henter, Yusuke Miyao, Shin'ichi Satoh

First we show that, by replacing model samples with ground-truth sentences, RL training can be seen as a form of weighted cross-entropy loss, giving a fast, RL-based pre-training algorithm.

Reinforcement Learning (RL) Video Captioning

Paper
Add Code

Median-Based Generation of Synthetic Speech Durations using a Non-Parametric Approach

no code implementations • 22 Aug 2016 • Srikanth Ronanki, Oliver Watts, Simon King, Gustav Eje Henter

This paper proposes a new approach to duration modelling for statistical parametric speech synthesis in which a recurrent statistical model is trained to output a phone transition probability at each timestep (acoustic frame).

Speech Synthesis

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.