Search Results for author: Timo Gerkmann

Found 56 papers, 21 papers with code

Unsupervised Blind Joint Dereverberation and Room Acoustics Estimation with Diffusion Models

no code implementations14 Aug 2024 Jean-Marie Lemercier, Eloi Moliner, Simon Welker, Vesa Välimäki, Timo Gerkmann

This paper presents an unsupervised method for single-channel blind dereverberation and room impulse response (RIR) estimation, called BUDDy.

Room Impulse Response (RIR) Speech Dereverberation

Robustness of Speech Separation Models for Similar-pitch Speakers

no code implementations22 Jul 2024 Bunlong Lay, Sebastian Zaczek, Kristina Tesch, Timo Gerkmann

Single-channel speech separation is a crucial task for enhancing speech recognition systems in multi-speaker environments.

speech-recognition Speech Recognition +1

EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation

2 code implementations10 Jun 2024 Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinji Watanabe, Alexander Richard, Timo Gerkmann

We release the EARS (Expressive Anechoic Recordings of Speech) dataset, a high-quality speech dataset comprising 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data.

Speech Enhancement

The PESQetarian: On the Relevance of Goodhart's Law for Speech Enhancement

no code implementations5 Jun 2024 Danilo de Oliveira, Simon Welker, Julius Richter, Timo Gerkmann

The goal of this paper is to illustrate the risk of overfitting a speech enhancement model to the metric used for evaluation.

Speech Enhancement

BUDDy: Single-Channel Blind Unsupervised Dereverberation with Diffusion Models

no code implementations7 May 2024 Eloi Moliner, Jean-Marie Lemercier, Simon Welker, Timo Gerkmann, Vesa Välimäki

In this paper, we present an unsupervised single-channel method for joint blind dereverberation and room impulse response estimation, based on posterior sampling with diffusion models.

Diffusion Models for Audio Restoration

no code implementations15 Feb 2024 Jean-Marie Lemercier, Julius Richter, Simon Welker, Eloi Moliner, Vesa Välimäki, Timo Gerkmann

Here, we aim to show that diffusion models can combine the best of both worlds and offer the opportunity to design audio restoration algorithms with a good degree of interpretability and a remarkable performance in terms of sound quality.

Speech Enhancement

An Analysis of the Variance of Diffusion-based Speech Enhancement

no code implementations1 Feb 2024 Bunlong Lay, Timo Gerkmann

The speech enhancement performance varies depending on the choice of the stochastic differential equation that controls the evolution of the mean and the variance along the diffusion processes when adding environmental and Gaussian noise.

Speech Enhancement

Single and Few-step Diffusion for Generative Speech Enhancement

1 code implementation18 Sep 2023 Bunlong Lay, Jean-Marie Lemercier, Julius Richter, Timo Gerkmann

While the performance of usual generative diffusion algorithms drops dramatically when lowering the number of function evaluations (NFEs) to obtain single-step diffusion, we show that our proposed method keeps a steady performance and therefore largely outperforms the diffusion baseline in this setting and also generalizes better than its predictive counterpart.

Denoising Speech Enhancement

Distilling HuBERT with LSTMs via Decoupled Knowledge Distillation

no code implementations18 Sep 2023 Danilo de Oliveira, Timo Gerkmann

Much research effort is being applied to the task of compressing the knowledge of self-supervised models, which are powerful, yet large and memory consuming.

Automatic Speech Recognition Knowledge Distillation +2

Live Iterative Ptychography with projection-based algorithms

no code implementations14 Sep 2023 Simon Welker, Tal Peer, Henry N. Chapman, Timo Gerkmann

In this work, we demonstrate that the ptychographic phase problem can be solved in a live fashion during scanning, while data is still being collected.

Retrieval

EMOCONV-DIFF: Diffusion-based Speech Emotion Conversion for Non-parallel and In-the-wild Data

no code implementations14 Sep 2023 Navin Raj Prabhu, Bunlong Lay, Simon Welker, Nale Lehmann-Willenbrock, Timo Gerkmann

Subsequently, at inference, a target emotion embedding is employed to convert the emotion of the input utterance to the given target emotion.

A Flexible Online Framework for Projection-Based STFT Phase Retrieval

no code implementations13 Sep 2023 Tal Peer, Simon Welker, Johannes Kolhoff, Timo Gerkmann

Several recent contributions in the field of iterative STFT phase retrieval have demonstrated that the performance of the classical Griffin-Lim method can be considerably improved upon.

Retrieval

Wind Noise Reduction with a Diffusion-based Stochastic Regeneration Model

1 code implementation22 Jun 2023 Jean-Marie Lemercier, Joachim Thiemann, Raphael Koning, Timo Gerkmann

We show that our stochastic regeneration model outperforms other neural-network-based wind noise reduction methods as well as purely predictive and generative models, on a dataset using simulated and real-recorded wind noise.

Diffusion Posterior Sampling for Informed Single-Channel Dereverberation

1 code implementation21 Jun 2023 Jean-Marie Lemercier, Simon Welker, Timo Gerkmann

We present in this paper an informed single-channel dereverberation method based on conditional generation with diffusion models.

On the Behavior of Intrusive and Non-intrusive Speech Enhancement Metrics in Predictive and Generative Settings

no code implementations5 Jun 2023 Danilo de Oliveira, Julius Richter, Jean-Marie Lemercier, Tal Peer, Timo Gerkmann

Since its inception, the field of deep speech enhancement has been dominated by predictive (discriminative) approaches, such as spectral mapping or masking.

Denoising Speech Enhancement

In-the-wild Speech Emotion Conversion Using Disentangled Self-Supervised Representations and Neural Vocoder-based Resynthesis

no code implementations2 Jun 2023 Navin Raj Prabhu, Nale Lehmann-Willenbrock, Timo Gerkmann

In this work, we specifically focus on in-the-wild emotion conversion where parallel data does not exist, and the problem of disentangling lexical, speaker, and emotion information arises.

Resynthesis

Audio-Visual Speech Enhancement with Score-Based Generative Models

no code implementations2 Jun 2023 Julius Richter, Simone Frintrop, Timo Gerkmann

This paper introduces an audio-visual speech enhancement system that leverages score-based generative models, also known as diffusion models, conditioned on visual information.

Automatic Speech Recognition Lipreading +3

Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model

1 code implementation31 May 2023 Héctor Martel, Julius Richter, Kai Li, Xiaolin Hu, Timo Gerkmann

We propose Audio-Visual Lightweight ITerative model (AVLIT), an effective and lightweight neural network that uses Progressive Learning (PL) to perform audio-visual speech separation in noisy environments.

Speech Separation

Leveraging Semantic Information for Efficient Self-Supervised Emotion Recognition with Audio-Textual Distilled Models

no code implementations30 May 2023 Danilo de Oliveira, Navin Raj Prabhu, Timo Gerkmann

In large part due to their implicit semantic modeling, self-supervised learning (SSL) methods have significantly increased the performance of valence recognition in speech emotion recognition (SER) systems.

Self-Supervised Learning Speech Emotion Recognition

Integrating Uncertainty into Neural Network-based Speech Enhancement

1 code implementation15 May 2023 Huajian Fang, Dennis Becker, Stefan Wermter, Timo Gerkmann

In this paper, we study the benefits of modeling uncertainty in clean speech estimation.

Speech Enhancement

Multi-channel Speech Separation Using Spatially Selective Deep Non-linear Filters

no code implementations24 Apr 2023 Kristina Tesch, Timo Gerkmann

In a multi-channel separation task with multiple speakers, we aim to recover all individual speech signals from the mixture.

Speech Separation

Partially Adaptive Multichannel Joint Reduction of Ego-noise and Environmental Noise

no code implementations27 Mar 2023 Huajian Fang, Niklas Wittmer, Johannes Twiefel, Stefan Wermter, Timo Gerkmann

In this paper, we propose a multichannel partially adaptive scheme to jointly model ego-noise and environmental noise utilizing the VAE-NMF framework, where we take advantage of spatially and spectrally structured characteristics of ego-noise by pre-training the ego-noise model, while retaining the ability to adapt to unknown environmental noise.

Speech Signal Improvement Using Causal Generative Diffusion Models

no code implementations15 Mar 2023 Julius Richter, Simon Welker, Jean-Marie Lemercier, Bunlong Lay, Tal Peer, Timo Gerkmann

In this paper, we present a causal speech signal improvement system that is designed to handle different types of distortions.

Extending DNN-based Multiplicative Masking to Deep Subband Filtering for Improved Dereverberation

no code implementations1 Mar 2023 Jean-Marie Lemercier, Julian Tobergte, Timo Gerkmann

We demonstrate that the resulting deep subband filtering scheme outperforms multiplicative masking for dereverberation, while leaving the denoising performance virtually the same.

Denoising

StoRM: A Diffusion-based Stochastic Regeneration Model for Speech Enhancement and Dereverberation

2 code implementations22 Dec 2022 Jean-Marie Lemercier, Julius Richter, Simon Welker, Timo Gerkmann

As diffusion models are generative approaches they may also produce vocalizing and breathing artifacts in adverse conditions.

Speech Dereverberation

Uncertainty Estimation in Deep Speech Enhancement Using Complex Gaussian Mixture Models

no code implementations9 Dec 2022 Huajian Fang, Timo Gerkmann

Single-channel deep speech enhancement approaches often estimate a single multiplicative mask to extract clean speech without a measure of its accuracy.

Speech Enhancement Uncertainty Quantification

DriftRec: Adapting diffusion models to blind JPEG restoration

1 code implementation12 Nov 2022 Simon Welker, Henry N. Chapman, Timo Gerkmann

In this work, we utilize the high-fidelity generation abilities of diffusion models to solve blind JPEG restoration at high compression levels.

JPEG Artifact Removal

DiffPhase: Generative Diffusion-based STFT Phase Retrieval

no code implementations8 Nov 2022 Tal Peer, Simon Welker, Timo Gerkmann

Diffusion probabilistic models have been recently used in a variety of tasks, including speech enhancement and synthesis.

Imputation Retrieval +1

Spatially Selective Deep Non-linear Filters for Speaker Extraction

no code implementations4 Nov 2022 Kristina Tesch, Timo Gerkmann

In a scenario with multiple persons talking simultaneously, the spatial characteristics of the signals are the most distinct feature for extracting the target signal.

Speech Separation

Analysing Diffusion-based Generative Approaches versus Discriminative Approaches for Speech Restoration

1 code implementation4 Nov 2022 Jean-Marie Lemercier, Julius Richter, Simon Welker, Timo Gerkmann

In this paper, we systematically compare the performance of generative diffusion models and discriminative approaches on different speech restoration tasks.

Bandwidth Extension Speech Denoising +1

Label Uncertainty Modeling and Prediction for Speech Emotion Recognition using t-Distributions

1 code implementation25 Jul 2022 Navin Raj Prabhu, Nale Lehmann-Willenbrock, Timo Gerkmann

To address this, these emotion annotations are typically collected by multiple annotators and averaged across annotators in order to obtain labels for arousal and valence.

Speech Emotion Recognition

Insights Into Deep Non-linear Filters for Improved Multi-channel Speech Enhancement

1 code implementation27 Jun 2022 Kristina Tesch, Timo Gerkmann

The key advantage of using multiple microphones for speech enhancement is that spatial filtering can be used to complement the tempo-spectral processing.

Speech Enhancement

On the Role of Spatial, Spectral, and Temporal Processing for DNN-based Non-linear Multi-channel Speech Enhancement

1 code implementation22 Jun 2022 Kristina Tesch, Nils-Hendrik Mohrmann, Timo Gerkmann

Employing deep neural networks (DNNs) to directly learn filters for multi-channel speech enhancement has potentially two key advantages over a traditional approach combining a linear spatial filter with an independent tempo-spectral post-filter: 1) non-linear spatial filtering allows to overcome potential restrictions originating from a linear processing model and 2) joint processing of spatial and tempo-spectral information allows to exploit interdependencies between different sources of information.

Speech Enhancement Speech Extraction

Beyond Griffin-Lim: Improved Iterative Phase Retrieval for Speech

no code implementations11 May 2022 Tal Peer, Simon Welker, Timo Gerkmann

Phase retrieval is a problem encountered not only in speech and audio processing, but in many other fields such as optics.

Retrieval

A neural network-supported two-stage algorithm for lightweight dereverberation on hearing devices

no code implementations6 Apr 2022 Jean-Marie Lemercier, Joachim Thiemann, Raphael Koning, Timo Gerkmann

By deriving new metrics analyzing the dereverberation performance in various time ranges, we confirm that directly optimizing for a criterion at the output of the multi-channel linear filtering stage results in a more efficient dereverberation as compared to placing the criterion at the output of the DNN to optimize the PSD estimation.

Customizable End-to-end Optimization of Online Neural Network-supported Dereverberation for Hearing Devices

no code implementations6 Apr 2022 Jean-Marie Lemercier, Joachim Thiemann, Raphael Koning, Timo Gerkmann

This work focuses on online dereverberation for hearing devices using the weighted prediction error (WPE) algorithm.

Neural Network-augmented Kalman Filtering for Robust Online Speech Dereverberation in Noisy Reverberant Environments

no code implementations6 Apr 2022 Jean-Marie Lemercier, Joachim Thiemann, Raphael Koning, Timo Gerkmann

In this paper, a neural network-augmented algorithm for noise-robust online dereverberation with a Kalman filtering variant of the weighted prediction error (WPE) method is proposed.

Denoising Speech Dereverberation

Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain

1 code implementation31 Mar 2022 Simon Welker, Julius Richter, Timo Gerkmann

Score-based generative models (SGMs) have recently shown impressive results for difficult generative tasks such as the unconditional and conditional generation of natural images and audio signals.

Speech Enhancement

Phase-Aware Deep Speech Enhancement: It's All About The Frame Length

no code implementations30 Mar 2022 Tal Peer, Timo Gerkmann

Algorithmic latency in speech processing is dominated by the frame length used for Fourier analysis, which in turn limits the achievable performance of magnitude-centric approaches.

Speech Enhancement

Integrating Statistical Uncertainty into Neural Network-Based Speech Enhancement

no code implementations4 Mar 2022 Huajian Fang, Tal Peer, Stefan Wermter, Timo Gerkmann

Speech enhancement in the time-frequency domain is often performed by estimating a multiplicative mask to extract clean speech.

Speech Enhancement

Deep Iterative Phase Retrieval for Ptychography

no code implementations17 Feb 2022 Simon Welker, Tal Peer, Henry N. Chapman, Timo Gerkmann

One of the most prominent challenges in the field of diffractive imaging is the phase retrieval (PR) problem: In order to reconstruct an object from its diffraction pattern, the inverse Fourier transform must be computed.

Retrieval

End-To-End Label Uncertainty Modeling for Speech-based Arousal Recognition Using Bayesian Neural Networks

1 code implementation7 Oct 2021 Navin Raj Prabhu, Guillaume Carbajal, Nale Lehmann-Willenbrock, Timo Gerkmann

At training, the network learns a distribution of weights to capture the inherent uncertainty related to subjective arousal annotations.

Speech Emotion Recognition

Disentanglement Learning for Variational Autoencoders Applied to Audio-Visual Speech Enhancement

1 code implementation19 May 2021 Guillaume Carbajal, Julius Richter, Timo Gerkmann

In this work, we propose to use an adversarial training scheme for variational autoencoders to disentangle the label from the other latent variables.

Attribute Decoder +2

Nonlinear Spatial Filtering in Multichannel Speech Enhancement

no code implementations22 Apr 2021 Kristina Tesch, Timo Gerkmann

Rather, the MMSE optimal filter is a joint spatial and spectral nonlinear function.

Speech Enhancement

Variational Autoencoder for Speech Enhancement with a Noise-Aware Encoder

no code implementations17 Feb 2021 Huajian Fang, Guillaume Carbajal, Stefan Wermter, Timo Gerkmann

Recently, a generative variational autoencoder (VAE) has been proposed for speech enhancement to model speech statistics.

Speech Enhancement

Guided Variational Autoencoder for Speech Enhancement With a Supervised Classifier

no code implementations12 Feb 2021 Guillaume Carbajal, Julius Richter, Timo Gerkmann

In this paper, we propose to guide the variational autoencoder with a supervised classifier separately trained on noisy speech.

Speech Enhancement

Efficient Joint Estimation of Tracer Distribution and Background Signals in Magnetic Particle Imaging using a Dictionary Approach

1 code implementation10 Jun 2020 Tobias Knopp, Mirco Grosser, Matthias Graeser, Timo Gerkmann, Martin Möddel

Background signals are a primary source of artifacts in magnetic particle imaging and limit the sensitivity of the method since background signals are often not precisely known and vary over time.

SNR-Based Features and Diverse Training Data for Robust DNN-Based Speech Enhancement

no code implementations7 Apr 2020 Robert Rehr, Timo Gerkmann

In this paper, we address the generalization of deep neural network (DNN) based speech enhancement to unseen noise conditions for the case that training data is limited in size and diversity.

Diversity Speech Enhancement

Robust Robotic Pouring using Audition and Haptics

1 code implementation29 Feb 2020 Hongzhuo Liang, Chuangchuang Zhou, Shuang Li, Xiaojian Ma, Norman Hendrich, Timo Gerkmann, Fuchun Sun, Marcus Stoffel, Jianwei Zhang

Both network training results and robot experiments demonstrate that MP-Net is robust against noise and changes to the task and environment.

A Multi-Phase Gammatone Filterbank for Speech Separation via TasNet

1 code implementation25 Oct 2019 David Ditter, Timo Gerkmann

In this work, we investigate if the learned encoder of the end-to-end convolutional time domain audio separation network (Conv-TasNet) is the key to its recent success, or if the encoder can just as well be replaced by a deterministic hand-crafted filterbank.

Low-latency processing Speech Separation

Making Sense of Audio Vibration for Liquid Height Estimation in Robotic Pouring

1 code implementation2 Mar 2019 Hongzhuo Liang, Shuang Li, Xiaojian Ma, Norman Hendrich, Timo Gerkmann, Jianwei Zhang

PouringNet is trained on our collected real-world pouring dataset with multimodal sensing data, which contains more than 3000 recordings of audio, force feedback, video and trajectory data of the human hand that performs the pouring task.

Robotics Sound Audio and Speech Processing

DNN and CNN with Weighted and Multi-task Loss Functions for Audio Event Detection

no code implementations10 Aug 2017 Huy Phan, Martin Krawczyk-Becker, Timo Gerkmann, Alfred Mertins

Our proposed systems significantly outperform the challenge baseline, improving F-score from 72. 7% to 90. 0% and reducing detection error rate from 0. 53 to 0. 18 on average on the development data.

Event Detection Task 2

Cannot find the paper you are looking for? You can Submit a new open access paper.