1 code implementation • 10 Jan 2025 • Nikolai Lund Kühne, Jan Østergaard, Jesper Jensen, Zheng-Hua Tan
While attention-based architectures, such as Conformers, excel in speech enhancement, they face challenges such as scalability with respect to input sequence length.
Ranked #7 on Speech Enhancement on VoiceBank + DEMAND
no code implementations • 7 Jan 2025 • Achintya kr. Sarkar, Priyanka Dwivedi, Zheng-Hua Tan
During testing, the VTL features with different warping factors of a test utterance are scored against the DNN and combined with equal weight.
no code implementations • 6 Jan 2025 • Holger Severin Bovbjerg, Jan Østergaard, Jesper Jensen, Zheng-Hua Tan
This underscores the effectiveness of SSL pretraining in improving the robustness and performance of TS-VAD models in noisy environments.
1 code implementation • 3 Oct 2024 • Gustav Wagner Zakarias, Lars Kai Hansen, Zheng-Hua Tan
In this work, we present BiSSL, a first-of-its-kind training framework that introduces bilevel optimization to enhance the alignment between the pretext pre-training and downstream fine-tuning stages in self-supervised learning.
1 code implementation • 12 Sep 2024 • Nikolai L. Kühne, Astrid H. F. Kitchen, Marie S. Jensen, Mikkel S. L. Brøndt, Martin Gonzalez, Christophe Biscio, Zheng-Hua Tan
In this paper, we systematically investigate the use of DMs for defending against adversarial attacks on sentences and examine the effect of varying forward diffusion steps.
no code implementations • 5 Sep 2024 • Yuying Xie, Michael Kuhlmann, Frederik Rautenberg, Zheng-Hua Tan, Reinhold Haeb-Umbach
Speech signals encompass various information across multiple levels including content, speaker, and style.
no code implementations • 10 Jun 2024 • Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May
Diffusion models are typically trained with massive datasets for image generation tasks, but whether this is also required for speech enhancement is unknown.
1 code implementation • 4 Jun 2024 • Sarthak Yadav, Zheng-Hua Tan
Despite its widespread adoption as the prominent neural architecture, the Transformer has spurred several independent lines of work to address its limitations.
1 code implementation • 27 Mar 2024 • Jacob Mørk, Holger Severin Bovbjerg, Gergely Kiss, Zheng-Hua Tan
Modern KWS systems are mainly trained using supervised learning methods and require a large amount of labelled data to achieve a good performance.
no code implementations • 15 Mar 2024 • Peter Leer, Jesper Jensen, Laurel H. Carney, Zheng-Hua Tan, Jan Østergaard, Lars Bramsløw
First, we introduce a framework for emulating auditory models using DNNs, focusing on an auditory-nerve model in the auditory pathway.
1 code implementation • 15 Mar 2024 • Peter Leer, Jesper Jensen, Zheng-Hua Tan, Jan Østergaard, Lars Bramsløw
Our results show that this new optimization objective significantly improves the emulation performance of deep neural networks across relevant input sound levels and auditory-model frequency channels, without increasing the computational load during inference.
no code implementations • 27 Dec 2023 • Holger Severin Bovbjerg, Jesper Jensen, Jan Østergaard, Zheng-Hua Tan
Our experiments show that self-supervised pretraining not only improves performance in clean conditions, but also yields models which are more robust to adverse conditions compared to purely supervised learning.
1 code implementation • 15 Dec 2023 • Deividas Eringis, John Leth, Zheng-Hua Tan, Rafal Wisniewski, Mihaly Petreczky
In this paper, we derive a PAC-Bayes bound on the generalisation gap, in a supervised time-series setting for a special class of discrete-time non-linear dynamical systems.
1 code implementation • 7 Dec 2023 • Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May
To address this, we extend this framework to account for the progressive transformation between the clean and noisy speech signals.
no code implementations • 5 Dec 2023 • Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May
We show that the proposed system substantially benefits from using multiple databases for training, and achieves superior performance compared to state-of-the-art discriminative models in both matched and mismatched conditions.
no code implementations • 20 Sep 2023 • Andreas J. Fuglsig, Jesper Jensen, Zheng-Hua Tan, Lars S. Bertelsen, Jens Christian Lindof, Jan Østergaard
Results show that the joint optimization can further improve performance compared to the concatenated approach.
no code implementations • 1 Jun 2023 • Juan F. Montesinos, Daniel Michelsanti, Gloria Haro, Zheng-Hua Tan, Jesper Jensen
Audio and visual modalities are inherently connected in speech signals: lip movements and facial expressions are correlated with speech sounds.
no code implementations • 1 Jun 2023 • Sarthak Yadav, Sergios Theodoridis, Lars Kai Hansen, Zheng-Hua Tan
In this work, we propose a Multi-Window Masked Autoencoder (MW-MAE) fitted with a novel Multi-Window Multi-Head Attention (MW-MHA) module that facilitates the modelling of local-global interactions in every decoder transformer block through attention heads of several distinct local and global windows.
no code implementations • 29 Mar 2023 • Deividas Eringis, John Leth, Zheng-Hua Tan, Rafael Wisniewski, Mihaly Petreczky
In this paper we derive a Probably Approxilmately Correct(PAC)-Bayesian error bound for linear time-invariant (LTI) stochastic dynamical systems with inputs.
no code implementations • 30 Dec 2022 • Deividas Eringis, John Leth, Zheng-Hua Tan, Rafal Wisniewski, Mihaly Petreczky
In this paper we derive a PAC-Bayesian-Like error bound for a class of stochastic dynamical systems with inputs, namely, for linear time-invariant stochastic state-space models (stochastic LTI systems for short).
no code implementations • 19 Nov 2022 • Iván López-Espejo, Ram C. M. C. Shekar, Zheng-Hua Tan, Jesper Jensen, John H. L. Hansen
In the context of keyword spotting (KWS), the replacement of handcrafted speech features by learnable features has not yielded superior KWS performance.
no code implementations • 15 Nov 2022 • Yuying Xie, Thomas Arildsen, Zheng-Hua Tan
For the prior of speaker identity variable, \acrshort{fhvae} assumes it is a Gaussian distribution with an utterance-scale varying mean and a fixed variance.
no code implementations • 3 Nov 2022 • Christian Heider Nielsen, Zheng-Hua Tan
Extensive experiments show that the inverse filter bank features generally perform better in both clean and noisy environments, the detection is effective using either speech or non-speech part, and the acoustic noise can largely degrade the detection performance.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 31 Oct 2022 • Andreas Jonas Fuglsig, Jesper Jensen, Zheng-Hua Tan, Lars Søndergaard Bertelsen, Jens Christian Lindof, Jan Østergaard
The intelligibility and quality of speech from a mobile phone or public announcement system are often affected by background noise in the listening environment.
2 code implementations • 4 Oct 2022 • Holger Severin Bovbjerg, Zheng-Hua Tan
This paper explores the effectiveness of SSL on small models for KWS and establishes that SSL can enhance the performance of small KWS models when labelled data is scarce.
1 code implementation • 4 Jul 2022 • Claus Meyer Larsen, Peter Koch, Zheng-Hua Tan
Introducing adversarial multi-task learning to the model is observed to increase performance in terms of Area Under Curve (AUC), particularly in noisy environments, while the performance is not degraded at higher SNR levels.
no code implementations • 21 Jun 2022 • Cristian J. Vaca-Rubio, Roberto Pereira, Xavier Mestre, David Gregoratti, Zheng-Hua Tan, Elisabeth de Carvalho, Petar Popovski
Environmental scene reconstruction is of great interest for autonomous robotic applications, since an accurate representation of the environment is necessary to ensure safe interaction with robots.
no code implementations • 17 May 2022 • Cristian J. Vaca-Rubio, Dariush Salami, Petar Popovski, Elisabeth de Carvalho, Zheng-Hua Tan, Stephan Sigg
Since electromagnetic signals are omnipresent, Radio Frequency (RF)-sensing has the potential to become a universal sensing mechanism with applications in localization, smart-home, retail, gesture recognition, intrusion detection, etc.
1 code implementation • 5 Apr 2022 • Yuying Xie, Thomas Arildsen, Zheng-Hua Tan
This work proposes a complex recurrent VAE framework, specifically in which complex-valued recurrent neural network and L1 reconstruction loss are used.
no code implementations • 5 Apr 2022 • Yuying Xie, Thomas Arildsen, Zheng-Hua Tan
As a self-supervised objective, autoregressive predictive coding (APC), on the other hand, has been used in extracting meaningful and transferable speech features for multiple downstream tasks.
no code implementations • 17 Jan 2022 • Achintya kr. Sarkar, Zheng-Hua Tan
Furthermore, we study a range of loss functions when speaker identity is used as the training target.
no code implementations • 20 Nov 2021 • Iván López-Espejo, Zheng-Hua Tan, John Hansen, Jesper Jensen
Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 15 Nov 2021 • Andreas Jonas Fuglsig, Jan Østergaard, Jesper Jensen, Lars Søndergaard Bertelsen, Peter Mariager, Zheng-Hua Tan
However, the existing optimal mutual information based method requires a complicated system model that includes natural speech variations, and relies on approximations and assumptions of the underlying signal distributions.
no code implementations • 4 Nov 2021 • Cristian J. Vaca-Rubio, Pablo Ramirez-Espinosa, Kimmo Kansanen, Zheng-Hua Tan, Elisabeth de Carvalho
This paper leverages the potential of Large Intelligent Surface (LIS) for radio sensing in 6G wireless networks.
no code implementations • 6 Sep 2021 • Deividas Eringis, John Leth, Zheng-Hua Tan, Rafal Wisniewski, Mihaly Petreczky
In this short article, we showcase the derivation of the optimal (minimum error variance) estimator, when one part of the stochastic LTI system output is not measured but is able to be predicted from the measured system outputs.
no code implementations • 12 Apr 2021 • Max Væhrens, Andreas Jonas Fuglsig, Anders Post Jacobsen, Nicolai Almskou Rasmussen, Victor Mølbach Nissen, Joachim Roland Hejslet, Zheng-Hua Tan
It might therefore be advantageous to improve SVAD methods with pre-processing to obtain superior VAD, which is under-explored.
1 code implementation • 2 Apr 2021 • Wei Rao, Yihui Fu, Yanxin Hu, Xin Xu, Yvkai Jv, Jiangyu Han, Zhongjie Jiang, Lei Xie, Yannan Wang, Shinji Watanabe, Zheng-Hua Tan, Hui Bu, Tao Yu, Shidong Shang
The ConferencingSpeech 2021 challenge is proposed to stimulate research on far-field multi-channel speech enhancement for video conferencing.
no code implementations • 23 Mar 2021 • Deividas Eringis, John Leth, Zheng-Hua Tan, Rafal Wisniewski, Alireza Fakhrizadeh Esfahani, Mihaly Petreczky
In this paper we derive a PAC-Bayesian error bound for autonomous stochastic LTI state-space models.
no code implementations • 3 Feb 2021 • Achintya Kumar Sarkar, Md Sahidullah, Zheng-Hua Tan
In this paper, we propose a novel method that trains pass-phrase specific deep neural network (PP-DNN) based auto-encoders for creating augmented data for text-dependent speaker verification (TD-SV).
no code implementations • 25 Nov 2020 • Achintya kr. Sarkar, Zheng-Hua Tan
In this letter, we propose a vocal tract length (VTL) perturbation method for text-dependent speaker verification (TD-SV), in which a set of TD-SV systems are trained, one for each VTL factor, and score-level fusion is applied to make a final decision.
no code implementations • 16 Nov 2020 • Cristian J. Vaca-Rubio, Pablo Ramirez-Espinosa, Kimmo Kansanen, Zheng-Hua Tan, Elisabeth de Carvalho, Petar Popovski
By treating an LIS as a radio image of the environment relying on the received signal power, we develop techniques to sense the environment, by leveraging the tools of image processing and machine learning.
no code implementations • 12 Oct 2020 • Zeyu Song, Dongliang Chang, Zhanyu Ma, Xiaoxu Li, Zheng-Hua Tan
The loss function is a key component in deep learning models.
1 code implementation • 11 Oct 2020 • Jiyang Xie, Zhanyu Ma, and Jianjun Lei, Guoqiang Zhang, Jing-Hao Xue, Zheng-Hua Tan, Jun Guo
Due to lack of data, overfitting ubiquitously exists in real-world applications of deep neural networks (DNNs).
no code implementations • 9 Oct 2020 • Giovanni Morrone, Daniel Michelsanti, Zheng-Hua Tan, Jesper Jensen
In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i. e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information.
1 code implementation • 21 Aug 2020 • Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Jesper Jensen
Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources.
no code implementations • 26 Jul 2020 • Md Sahidullah, Achintya Kumar Sarkar, Ville Vestman, Xuechen Liu, Romain Serizel, Tomi Kinnunen, Zheng-Hua Tan, Emmanuel Vincent
Our primary submission to the challenge is the fusion of seven subsystems which yields a normalized minimum detection cost function (minDCF) of 0. 072 and an equal error rate (EER) of 2. 14% on the evaluation set.
no code implementations • 11 Jun 2020 • Cristian J. Vaca-Rubio, Pablo Ramirez-Espinosa, Robin Jess Williams, Kimmo Kansanen, Zheng-Hua Tan, Elisabeth de Carvalho, Petar Popovski
One of the beyond-5G developments that is often highlighted is the integration of wireless communication and radio sensing.
no code implementations • 30 May 2020 • Iván López-Espejo, Zheng-Hua Tan, Jesper Jensen
Despite their great performance over the years, handcrafted speech features are not necessarily optimal for any particular speech application.
no code implementations • 15 May 2020 • Achintya Kumar Sarkar, Zheng-Hua Tan
We further investigate the impact of the different bottleneck (BN) features on the performance of x-vectors, including the recently-introduced time-contrastive-learning (TCL) BN features and phone-discriminant BN features.
1 code implementation • 20 Apr 2020 • Xiaoxu Li, Dongliang Chang, Zhanyu Ma, Zheng-Hua Tan, Jing-Hao Xue, Jie Cao, Jingyi Yu, Jun Guo
A deep neural network of multiple nonlinear layers forms a large function space, which can easily lead to overfitting when it encounters small-sample data.
no code implementations • 6 Apr 2020 • Daniel Michelsanti, Olga Slizovskaia, Gloria Haro, Emilia Gómez, Zheng-Hua Tan, Jesper Jensen
Both acoustic and visual information influence human perception of speech.
no code implementations • 4 Feb 2020 • Miklas S. Kristoffersen, Sven E. Shepstone, Zheng-Hua Tan
This embedding space is used for exploring relevant content in various viewing settings by applying an N-pairs loss objective as well as a relaxed variant proposed in this paper.
1 code implementation • 22 Oct 2019 • Saeid Samizade, Zheng-Hua Tan, Chao Shen, Xiaohong Guan
Machine Learning systems are vulnerable to adversarial attacks and will highly likely produce incorrect outputs under these attacks.
no code implementations • 13 Sep 2019 • Miklas S. Kristoffersen, Jacob L. Wieland, Sven E. Shepstone, Zheng-Hua Tan, Vinoba Vinayagamoorthy
This paper proposes a deep learning-based method for learning joint context-content embeddings (JCCE) with a view to context-aware recommendations, and demonstrate its application in the television domain.
no code implementations • 3 Sep 2019 • Morten Kolbæk, Zheng-Hua Tan, Søren Holdt Jensen, Jesper Jensen
Finally, we show that a loss function based on scale-invariant signal-to-distortion ratio (SI-SDR) achieves good general performance across a range of popular speech enhancement evaluation metrics, which suggests that SI-SDR is a good candidate as a general-purpose loss function for speech enhancement systems.
no code implementations • 22 Jun 2019 • Iván López-Espejo, Zheng-Hua Tan, Jesper Jensen
Our results show that this multi-task deep residual network is able to achieve a KWS accuracy relative improvement of around 32% with respect to a system that does not deal with external speakers.
3 code implementations • 9 Jun 2019 • Zheng-Hua Tan, Achintya kr. Sarkar, Najim Dehak
In the end, a posteriori SNR weighted energy difference is applied to the extended pitch segments of the denoised speech signal for detecting voice activity.
no code implementations • 29 May 2019 • Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen
Regarding speech intelligibility, we find a general tendency of the benefit in training the systems with Lombard speech.
no code implementations • 11 May 2019 • Achintya kr. Sarkar, Zheng-Hua Tan, Hao Tang, Suwon Shon, James Glass
There are a number of studies about extraction of bottleneck (BN) features from deep neural networks (DNNs)trained to discriminate speakers, pass-phrases and triphone states for improving the performance of text-dependent speaker verification (TD-SV).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 12 Dec 2018 • Andrea Coifman, Péter Rohoska, Miklas S. Kristoffersen, Sven E. Shepstone, Zheng-Hua Tan
Attention level estimation systems have a high potential in many use cases, such as human-robot interaction, driver modeling and smart home systems, since being able to measure a person's attention level opens the possibility to natural interaction between humans and computers.
no code implementations • 15 Nov 2018 • Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen
Audio-visual speech enhancement (AV-SE) is the task of improving speech quality and intelligibility in a noisy environment using audio and visual information from a talker.
no code implementations • 15 Nov 2018 • Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen
Humans tend to change their way of speaking when they are immersed in a noisy environment, a reflex known as Lombard effect.
no code implementations • 30 Jul 2018 • Miklas S. Kristoffersen, Sven E. Shepstone, Zheng-Hua Tan
Home entertainment systems feature in a variety of usage scenarios with one or more simultaneous users, for whom the complexity of choosing media to consume has increased rapidly over the last decade.
no code implementations • 18 Apr 2018 • Ioannis T. Christou, Emmanouil Amolochitis, Zheng-Hua Tan
We present QARMA, an efficient novel parallel algorithm for mining all Quantitative Association Rules in large multidimensional datasets where items are required to have at least a single common attribute to be specified in the rules single consequent item.
no code implementations • 2 Feb 2018 • Morten Kolbæk, Zheng-Hua Tan, Jesper Jensen
Finally, we show that the proposed SE system performs on par with a traditional DNN based Short-Time Spectral Amplitude (STSA) SE system in terms of estimated speech intelligibility.
Sound Audio and Speech Processing
no code implementations • 6 Sep 2017 • Daniel Michelsanti, Zheng-Hua Tan
Improving speech system performance in noisy environments remains a challenging task, and speech enhancement (SE) is one of the effective techniques to solve the problem.
no code implementations • 31 Aug 2017 • Morten Kolbæk, Dong Yu, Zheng-Hua Tan, Jesper Jensen
We show that deep bi-directional LSTM RNNs trained using uPIT in noisy environments can improve the Signal-to-Distortion Ratio (SDR) as well as the Extended Short-Time Objective Intelligibility (ESTOI) measure, on the speaker independent multi-talker speech separation and denoising task, for various noise types and Signal-to-Noise Ratios (SNRs).
Sound
no code implementations • 30 May 2017 • Zhanyu Ma, Jing-Hao Xue, Arne Leijon, Zheng-Hua Tan, Zhen Yang, Jun Guo
In this paper, we propose novel strategies for neutral vector variable decorrelation.
no code implementations • 6 Apr 2017 • Achintya Kr. Sarkar, Zheng-Hua Tan
It is well-known that speech signals exhibit quasi-stationary behavior in and only in a short interval, and the TCL method aims to exploit this temporal structure.
3 code implementations • 18 Mar 2017 • Morten Kolbæk, Dong Yu, Zheng-Hua Tan, Jesper Jensen
We evaluated uPIT on the WSJ0 and Danish two- and three-talker mixed-speech separation tasks and found that uPIT outperforms techniques based on Non-negative Matrix Factorization (NMF) and Computational Auditory Scene Analysis (CASA), and compares favorably with Deep Clustering (DPCL) and the Deep Attractor Network (DANet).
no code implementations • 13 Feb 2017 • Hong Yu, Zheng-Hua Tan, Zhanyu Ma, Jun Guo
In order to improve the reliability of speaker verification systems, we develop a new filter bank based cepstral feature, deep neural network filter bank cepstral coefficients (DNN-FBCC), to distinguish between natural and spoofed speech.
no code implementations • 19 Nov 2016 • A. K. Sarkar, Zheng-Hua Tan
In this paper, we propose pass-phrase dependent background models (PBMs) for text-dependent (TD) speaker verification (SV) to integrate the pass-phrase identification process into the conventional TD-SV system, where a PBM is derived from a text-independent background model through adaptation using the utterances of a particular pass-phrase.
1 code implementation • 1 Jul 2016 • Dong Yu, Morten Kolbæk, Zheng-Hua Tan, Jesper Jensen
We propose a novel deep learning model, which supports permutation invariant training (PIT), for speaker independent multi-talker speech separation, commonly known as the cocktail-party problem.