Search Results for author: Zheng-Hua Tan

Found 60 papers, 11 papers with code

Investigating the Design Space of Diffusion Models for Speech Enhancement

no code implementations7 Dec 2023 Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May

To address this, we extend this framework to account for the progressive transformation between the clean and noisy speech signals.

Image Generation Speech Enhancement

Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions Using a Heun-Based Sampler

no code implementations5 Dec 2023 Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May

We show that the proposed system substantially benefits from using multiple databases for training, and achieves superior performance compared to state-of-the-art discriminative models in both matched and mismatched conditions.

Image Generation Speech Enhancement

Joint Minimum Processing Beamforming and Near-end Listening Enhancement

no code implementations20 Sep 2023 Andreas J. Fuglsig, Jesper Jensen, Zheng-Hua Tan, Lars S. Bertelsen, Jens Christian Lindof, Jan Østergaard

In this paper, we formulate a joint far- and near-end minimum processing framework, that improves intelligibility while limiting speech distortions in favorable noise conditions.

Speech Enhancement

Speech inpainting: Context-based speech synthesis guided by video

no code implementations1 Jun 2023 Juan F. Montesinos, Daniel Michelsanti, Gloria Haro, Zheng-Hua Tan, Jesper Jensen

Audio and visual modalities are inherently connected in speech signals: lip movements and facial expressions are correlated with speech sounds.

speech-recognition Speech Recognition +1

Masked Autoencoders with Multi-Window Local-Global Attention Are Better Audio Learners

no code implementations1 Jun 2023 Sarthak Yadav, Sergios Theodoridis, Lars Kai Hansen, Zheng-Hua Tan

In this work, we propose a Multi-Window Masked Autoencoder (MW-MAE) fitted with a novel Multi-Window Multi-Head Attention (MW-MHA) module that facilitates the modelling of local-global interactions in every decoder transformer block through attention heads of several distinct local and global windows.

PAC-Bayesian bounds for learning LTI-ss systems with input from empirical loss

no code implementations29 Mar 2023 Deividas Eringis, John Leth, Zheng-Hua Tan, Rafael Wisniewski, Mihaly Petreczky

In this paper we derive a Probably Approxilmately Correct(PAC)-Bayesian error bound for linear time-invariant (LTI) stochastic dynamical systems with inputs.

PAC-Bayesian-Like Error Bound for a Class of Linear Time-Invariant Stochastic State-Space Models

no code implementations30 Dec 2022 Deividas Eringis, John Leth, Zheng-Hua Tan, Rafal Wisniewski, Mihaly Petreczky

In this paper we derive a PAC-Bayesian-Like error bound for a class of stochastic dynamical systems with inputs, namely, for linear time-invariant stochastic state-space models (stochastic LTI systems for short).


Filterbank Learning for Noise-Robust Small-Footprint Keyword Spotting

no code implementations19 Nov 2022 Iván López-Espejo, Ram C. M. C. Shekar, Zheng-Hua Tan, Jesper Jensen, John H. L. Hansen

In the context of keyword spotting (KWS), the replacement of handcrafted speech features by learnable features has not yielded superior KWS performance.

Small-Footprint Keyword Spotting

Improved disentangled speech representations using contrastive learning in factorized hierarchical variational autoencoder

no code implementations15 Nov 2022 Yuying Xie, Thomas Arildsen, Zheng-Hua Tan

For the prior of speaker identity variable, \acrshort{fhvae} assumes it is a Gaussian distribution with an utterance-scale varying mean and a fixed variance.

Contrastive Learning Disentanglement +4

Leveraging Domain Features for Detecting Adversarial Attacks Against Deep Speech Recognition in Noise

no code implementations3 Nov 2022 Christian Heider Nielsen, Zheng-Hua Tan

Extensive experiments show that the inverse filter bank features generally perform better in both clean and noisy environments, the detection is effective using either speech or non-speech part, and the acoustic noise can largely degrade the detection performance.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Minimum Processing Near-end Listening Enhancement

no code implementations31 Oct 2022 Andreas Jonas Fuglsig, Jesper Jensen, Zheng-Hua Tan, Lars Søndergaard Bertelsen, Jens Christian Lindof, Jan Østergaard

The intelligibility and quality of speech from a mobile phone or public announcement system are often affected by background noise in the listening environment.

Improving Label-Deficient Keyword Spotting Through Self-Supervised Pretraining

1 code implementation4 Oct 2022 Holger Severin Bovbjerg, Zheng-Hua Tan

This paper explores the effectiveness of SSL on small models for KWS and establishes that SSL can enhance the performance of small KWS models when labelled data is scarce.

Keyword Spotting Self-Supervised Learning

Adversarial Multi-Task Deep Learning for Noise-Robust Voice Activity Detection with Low Algorithmic Delay

1 code implementation4 Jul 2022 Claus Meyer Larsen, Peter Koch, Zheng-Hua Tan

Introducing adversarial multi-task learning to the model is observed to increase performance in terms of Area Under Curve (AUC), particularly in noisy environments, while the performance is not degraded at higher SNR levels.

Action Detection Activity Detection +1

Floor Map Reconstruction Through Radio Sensing and Learning By a Large Intelligent Surface

no code implementations21 Jun 2022 Cristian J. Vaca-Rubio, Roberto Pereira, Xavier Mestre, David Gregoratti, Zheng-Hua Tan, Elisabeth de Carvalho, Petar Popovski

Environmental scene reconstruction is of great interest for autonomous robotic applications, since an accurate representation of the environment is necessary to ensure safe interaction with robots.

User Localization using RF Sensing: A Performance comparison between LIS and mmWave Radars

no code implementations17 May 2022 Cristian J. Vaca-Rubio, Dariush Salami, Petar Popovski, Elisabeth de Carvalho, Zheng-Hua Tan, Stephan Sigg

Since electromagnetic signals are omnipresent, Radio Frequency (RF)-sensing has the potential to become a universal sensing mechanism with applications in localization, smart-home, retail, gesture recognition, intrusion detection, etc.

Gesture Recognition Intrusion Detection

Disentangled Speech Representation Learning Based on Factorized Hierarchical Variational Autoencoder with Self-Supervised Objective

no code implementations5 Apr 2022 Yuying Xie, Thomas Arildsen, Zheng-Hua Tan

As a self-supervised objective, autoregressive predictive coding (APC), on the other hand, has been used in extracting meaningful and transferable speech features for multiple downstream tasks.

Disentanglement Speaker Recognition +3

Complex Recurrent Variational Autoencoder with Application to Speech Enhancement

1 code implementation5 Apr 2022 Yuying Xie, Thomas Arildsen, Zheng-Hua Tan

This work proposes a complex recurrent VAE framework, specifically in which complex-valued recurrent neural network and L1 reconstruction loss are used.

Speech Enhancement

Deep Spoken Keyword Spotting: An Overview

no code implementations20 Nov 2021 Iván López-Espejo, Zheng-Hua Tan, John Hansen, Jesper Jensen

Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Joint Far- and Near-End Speech Intelligibility Enhancement based on the Approximated Speech Intelligibility Index

no code implementations15 Nov 2021 Andreas Jonas Fuglsig, Jan Østergaard, Jesper Jensen, Lars Søndergaard Bertelsen, Peter Mariager, Zheng-Hua Tan

However, the existing optimal mutual information based method requires a complicated system model that includes natural speech variations, and relies on approximations and assumptions of the underlying signal distributions.

Speech Enhancement

Radio Sensing with Large Intelligent Surface for 6G

no code implementations4 Nov 2021 Cristian J. Vaca-Rubio, Pablo Ramirez-Espinosa, Kimmo Kansanen, Zheng-Hua Tan, Elisabeth de Carvalho

This paper leverages the potential of Large Intelligent Surface (LIS) for radio sensing in 6G wireless networks.

Template Matching

Explicit construction of the minimum error variance estimator for stochastic LTI state-space systems

no code implementations6 Sep 2021 Deividas Eringis, John Leth, Zheng-Hua Tan, Rafal Wisniewski, Mihaly Petreczky

In this short article, we showcase the derivation of the optimal (minimum error variance) estimator, when one part of the stochastic LTI system output is not measured but is able to be predicted from the measured system outputs.

Data Generation Using Pass-phrase-dependent Deep Auto-encoders for Text-Dependent Speaker Verification

no code implementations3 Feb 2021 Achintya Kumar Sarkar, Md Sahidullah, Zheng-Hua Tan

In this paper, we propose a novel method that trains pass-phrase specific deep neural network (PP-DNN) based auto-encoders for creating augmented data for text-dependent speaker verification (TD-SV).

Decision Making Text-Dependent Speaker Verification +1

Vocal Tract Length Perturbation for Text-Dependent Speaker Verification with Autoregressive Prediction Coding

no code implementations25 Nov 2020 Achintya kr. Sarkar, Zheng-Hua Tan

In this letter, we propose a vocal tract length (VTL) perturbation method for text-dependent speaker verification (TD-SV), in which a set of TD-SV systems are trained, one for each VTL factor, and score-level fusion is applied to make a final decision.

Text-Dependent Speaker Verification

Assessing Wireless Sensing Potential with Large Intelligent Surfaces

no code implementations16 Nov 2020 Cristian J. Vaca-Rubio, Pablo Ramirez-Espinosa, Kimmo Kansanen, Zheng-Hua Tan, Elisabeth de Carvalho, Petar Popovski

By treating an LIS as a radio image of the environment relying on the received signal power, we develop techniques to sense the environment, by leveraging the tools of image processing and machine learning.

BIG-bench Machine Learning Denoising +1

Audio-Visual Speech Inpainting with Deep Learning

no code implementations9 Oct 2020 Giovanni Morrone, Daniel Michelsanti, Zheng-Hua Tan, Jesper Jensen

In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i. e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information.

Multi-Task Learning

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

1 code implementation21 Aug 2020 Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Jesper Jensen

Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources.

Speech Enhancement Speech Separation

UIAI System for Short-Duration Speaker Verification Challenge 2020

no code implementations26 Jul 2020 Md Sahidullah, Achintya Kumar Sarkar, Ville Vestman, Xuechen Liu, Romain Serizel, Tomi Kinnunen, Zheng-Hua Tan, Emmanuel Vincent

Our primary submission to the challenge is the fusion of seven subsystems which yields a normalized minimum detection cost function (minDCF) of 0. 072 and an equal error rate (EER) of 2. 14% on the evaluation set.

Text-Dependent Speaker Verification

Exploring Filterbank Learning for Keyword Spotting

no code implementations30 May 2020 Iván López-Espejo, Zheng-Hua Tan, Jesper Jensen

Despite their great performance over the years, handcrafted speech features are not necessarily optimal for any particular speech application.

Keyword Spotting

On Bottleneck Features for Text-Dependent Speaker Verification Using X-vectors

no code implementations15 May 2020 Achintya Kumar Sarkar, Zheng-Hua Tan

We further investigate the impact of the different bottleneck (BN) features on the performance of x-vectors, including the recently-introduced time-contrastive-learning (TCL) BN features and phone-discriminant BN features.

Contrastive Learning Text-Dependent Speaker Verification +2

OSLNet: Deep Small-Sample Classification with an Orthogonal Softmax Layer

1 code implementation20 Apr 2020 Xiaoxu Li, Dongliang Chang, Zhanyu Ma, Zheng-Hua Tan, Jing-Hao Xue, Jie Cao, Jingyi Yu, Jun Guo

A deep neural network of multiple nonlinear layers forms a large function space, which can easily lead to overfitting when it encounters small-sample data.

Classification General Classification

Relaxed N-Pairs Loss for Context-Aware Recommendations of Television Content

no code implementations4 Feb 2020 Miklas S. Kristoffersen, Sven E. Shepstone, Zheng-Hua Tan

This embedding space is used for exploring relevant content in various viewing settings by applying an N-pairs loss objective as well as a relaxed variant proposed in this paper.

Metric Learning

Adversarial Example Detection by Classification for Deep Speech Recognition

1 code implementation22 Oct 2019 Saeid Samizade, Zheng-Hua Tan, Chao Shen, Xiaohong Guan

Machine Learning systems are vulnerable to adversarial attacks and will highly likely produce incorrect outputs under these attacks.

Classification General Classification +3

Deep Joint Embeddings of Context and Content for Recommendation

no code implementations13 Sep 2019 Miklas S. Kristoffersen, Jacob L. Wieland, Sven E. Shepstone, Zheng-Hua Tan, Vinoba Vinayagamoorthy

This paper proposes a deep learning-based method for learning joint context-content embeddings (JCCE) with a view to context-aware recommendations, and demonstrate its application in the television domain.

Metric Learning

On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement

no code implementations3 Sep 2019 Morten Kolbæk, Zheng-Hua Tan, Søren Holdt Jensen, Jesper Jensen

Finally, we show that a loss function based on scale-invariant signal-to-distortion ratio (SI-SDR) achieves good general performance across a range of popular speech enhancement evaluation metrics, which suggests that SI-SDR is a good candidate as a general-purpose loss function for speech enhancement systems.

Speech Enhancement

Keyword Spotting for Hearing Assistive Devices Robust to External Speakers

no code implementations22 Jun 2019 Iván López-Espejo, Zheng-Hua Tan, Jesper Jensen

Our results show that this multi-task deep residual network is able to achieve a KWS accuracy relative improvement of around 32% with respect to a system that does not deal with external speakers.

Keyword Spotting Multi-Task Learning

rVAD: An Unsupervised Segment-Based Robust Voice Activity Detection Method

3 code implementations9 Jun 2019 Zheng-Hua Tan, Achintya kr. Sarkar, Najim Dehak

In the end, a posteriori SNR weighted energy difference is applied to the extended pitch segments of the denoised speech signal for detecting voice activity.

Action Detection Activity Detection +3

Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect

no code implementations29 May 2019 Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen

Regarding speech intelligibility, we find a general tendency of the benefit in training the systems with Lombard speech.

Speech Enhancement

Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification

no code implementations11 May 2019 Achintya kr. Sarkar, Zheng-Hua Tan, Hao Tang, Suwon Shon, James Glass

There are a number of studies about extraction of bottleneck (BN) features from deep neural networks (DNNs)trained to discriminate speakers, pass-phrases and triphone states for improving the performance of text-dependent speaker verification (TD-SV).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Subjective Annotations for Vision-Based Attention Level Estimation

no code implementations12 Dec 2018 Andrea Coifman, Péter Rohoska, Miklas S. Kristoffersen, Sven E. Shepstone, Zheng-Hua Tan

Attention level estimation systems have a high potential in many use cases, such as human-robot interaction, driver modeling and smart home systems, since being able to measure a person's attention level opens the possibility to natural interaction between humans and computers.

On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement

no code implementations15 Nov 2018 Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen

Audio-visual speech enhancement (AV-SE) is the task of improving speech quality and intelligibility in a noisy environment using audio and visual information from a talker.

Speech Enhancement

Effects of Lombard Reflex on the Performance of Deep-Learning-Based Audio-Visual Speech Enhancement Systems

no code implementations15 Nov 2018 Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen

Humans tend to change their way of speaking when they are immersed in a noisy environment, a reflex known as Lombard effect.

Speech Enhancement

The Importance of Context When Recommending TV Content: Dataset and Algorithms

no code implementations30 Jul 2018 Miklas S. Kristoffersen, Sven E. Shepstone, Zheng-Hua Tan

Home entertainment systems feature in a variety of usage scenarios with one or more simultaneous users, for whom the complexity of choosing media to consume has increased rapidly over the last decade.

Recommendation Systems

A Parallel/Distributed Algorithmic Framework for Mining All Quantitative Association Rules

no code implementations18 Apr 2018 Ioannis T. Christou, Emmanouil Amolochitis, Zheng-Hua Tan

We present QARMA, an efficient novel parallel algorithm for mining all Quantitative Association Rules in large multidimensional datasets where items are required to have at least a single common attribute to be specified in the rules single consequent item.


Monaural Speech Enhancement using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure

no code implementations2 Feb 2018 Morten Kolbæk, Zheng-Hua Tan, Jesper Jensen

Finally, we show that the proposed SE system performs on par with a traditional DNN based Short-Time Spectral Amplitude (STSA) SE system in terms of estimated speech intelligibility.

Sound Audio and Speech Processing

Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification

no code implementations6 Sep 2017 Daniel Michelsanti, Zheng-Hua Tan

Improving speech system performance in noisy environments remains a challenging task, and speech enhancement (SE) is one of the effective techniques to solve the problem.

Speaker Verification Speech Enhancement

Joint Separation and Denoising of Noisy Multi-talker Speech using Recurrent Neural Networks and Permutation Invariant Training

no code implementations31 Aug 2017 Morten Kolbæk, Dong Yu, Zheng-Hua Tan, Jesper Jensen

We show that deep bi-directional LSTM RNNs trained using uPIT in noisy environments can improve the Signal-to-Distortion Ratio (SDR) as well as the Extended Short-Time Objective Intelligibility (ESTOI) measure, on the speaker independent multi-talker speech separation and denoising task, for various noise types and Signal-to-Noise Ratios (SNRs).


Decorrelation of Neutral Vector Variables: Theory and Applications

no code implementations30 May 2017 Zhanyu Ma, Jing-Hao Xue, Arne Leijon, Zheng-Hua Tan, Zhen Yang, Jun Guo

In this paper, we propose novel strategies for neutral vector variable decorrelation.

Time-Contrastive Learning Based DNN Bottleneck Features for Text-Dependent Speaker Verification

no code implementations6 Apr 2017 Achintya Kr. Sarkar, Zheng-Hua Tan

It is well-known that speech signals exhibit quasi-stationary behavior in and only in a short interval, and the TCL method aims to exploit this temporal structure.

Contrastive Learning Text-Dependent Speaker Verification

Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks

3 code implementations18 Mar 2017 Morten Kolbæk, Dong Yu, Zheng-Hua Tan, Jesper Jensen

We evaluated uPIT on the WSJ0 and Danish two- and three-talker mixed-speech separation tasks and found that uPIT outperforms techniques based on Non-negative Matrix Factorization (NMF) and Computational Auditory Scene Analysis (CASA), and compares favorably with Deep Clustering (DPCL) and the Deep Attractor Network (DANet).

Clustering Deep Clustering +1

DNN Filter Bank Cepstral Coefficients for Spoofing Detection

no code implementations13 Feb 2017 Hong Yu, Zheng-Hua Tan, Zhanyu Ma, Jun Guo

In order to improve the reliability of speaker verification systems, we develop a new filter bank based cepstral feature, deep neural network filter bank cepstral coefficients (DNN-FBCC), to distinguish between natural and spoofed speech.

Speaker Verification Speech Synthesis

Incorporating Pass-Phrase Dependent Background Models for Text-Dependent Speaker Verification

no code implementations19 Nov 2016 A. K. Sarkar, Zheng-Hua Tan

In this paper, we propose pass-phrase dependent background models (PBMs) for text-dependent (TD) speaker verification (SV) to integrate the pass-phrase identification process into the conventional TD-SV system, where a PBM is derived from a text-independent background model through adaptation using the utterances of a particular pass-phrase.

Test Text-Dependent Speaker Verification

Permutation Invariant Training of Deep Models for Speaker-Independent Multi-talker Speech Separation

1 code implementation1 Jul 2016 Dong Yu, Morten Kolbæk, Zheng-Hua Tan, Jesper Jensen

We propose a novel deep learning model, which supports permutation invariant training (PIT), for speaker independent multi-talker speech separation, commonly known as the cocktail-party problem.

Clustering Deep Clustering +2

Cannot find the paper you are looking for? You can Submit a new open access paper.