In this paper, we propose two techniques, namely joint modeling and data augmentation, to improve system performances for audio-visual scene classification (AVSC).
Audio-only-based wake word spotting (WWS) is challenging under noisy conditions due to environmental interference in signal transmission.
We propose two improvements to target-speaker voice activity detection (TS-VAD), the core component in our proposed speaker diarization system that was submitted to the 2022 Multi-Channel Multi-Party Meeting Transcription (M2MeT) challenge.
We propose a variational Bayesian (VB) approach to learning distributions of latent variables in deep neural network (DNN) models for cross-domain knowledge transfer, to address acoustic mismatches between training and testing conditions.
We propose a separation guided speaker diarization (SGSD) approach by fully utilizing a complementarity of speech separation and speaker clustering.
We propose a novel neural model compression strategy combining data augmentation, knowledge transfer, pruning, and quantization for device-robust acoustic scene classification (ASC).
We propose using an adversarial autoencoder (AAE) to replace generative adversarial network (GAN) in the private aggregation of teacher ensembles (PATE), a solution for ensuring differential privacy in speech applications.
Ranked #3 on Keyword Spotting on Google Speech Commands (10-keyword Speech Commands dataset metric)
This system description describes our submission system to the Third DIHARD Speech Diarization Challenge.
In this paper, we propose a novel deep learning architecture to improving word-level lip-reading.
1 code implementation • 3 Nov 2020 • Hu Hu, Chao-Han Huck Yang, Xianjun Xia, Xue Bai, Xin Tang, Yajian Wang, Shutong Niu, Li Chai, Juanjuan Li, Hongning Zhu, Feng Bao, Yuanjun Zhao, Sabato Marco Siniscalchi, Yannan Wang, Jun Du, Chin-Hui Lee
To improve device robustness, a highly desirable key feature of a competitive data-driven acoustic scene classification (ASC) system, a novel two-stage system based on fully convolutional neural networks (CNNs) is proposed.
Ranked #1 on Acoustic Scene Classification on TAU Urban Acoustic Scenes 2019 (using extra training data)
Testing on the Google Speech Commands Dataset, the proposed QCNN encoder attains a competitive accuracy of 95. 12% in a decentralized model, which is better than the previous architectures using centralized RNN models with convolutional features.
Ranked #1 on Keyword Spotting on Google Speech Commands (10-keyword Speech Commands dataset metric)
We propose a novel noise-aware memory-attention network (NAMAN) for regression-based speech enhancement, aiming at improving quality of enhanced speech in unseen noise conditions.
We first extract visual embedding from lip frames using a pre-trained phone or articulation place recognizer for visual-only EASE (VEASE).
In this paper, we exploit the properties of mean absolute error (MAE) as a loss function for the deep neural network (DNN) based vector-to-vector regression.
In this paper, we show that, in vector-to-vector regression utilizing deep neural networks (DNNs), a generalized loss of mean absolute error (MAE) between the predicted and expected feature vectors is upper bounded by the sum of an approximation error, an estimation error, and an optimization error.
In contrast to building scene models with whole utterances, the ASM-removed sub-utterances, i. e., acoustic utterances without stop acoustic segments, are then used as inputs to the AlexNet-L back-end for final classification.
In this paper, we propose a domain adaptation framework to address the device mismatch issue in acoustic scene classification leveraging upon neural label embedding (NLE) and relational teacher student learning (RTSL).
Finally, our experiments of multi-channel speech enhancement on a simulated noisy WSJ0 corpus demonstrate that our proposed hybrid CNN-TT architecture achieves better results than both DNN and CNN models in terms of better-enhanced speech qualities and smaller parameter sizes.
1 code implementation • 16 Jul 2020 • Hu Hu, Chao-Han Huck Yang, Xianjun Xia, Xue Bai, Xin Tang, Yajian Wang, Shutong Niu, Li Chai, Juanjuan Li, Hongning Zhu, Feng Bao, Yuanjun Zhao, Sabato Marco Siniscalchi, Yannan Wang, Jun Du, Chin-Hui Lee
On Task 1b development data set, we achieve an accuracy of 96. 7\% with a model size smaller than 500KB.
We propose a novel neural label embedding (NLE) scheme for the domain adaptation of a deep neural network (DNN) acoustic model with unpaired data samples from source and target domains.
Recent studies have highlighted adversarial examples as ubiquitous threats to the deep neural network (DNN) based speech recognition systems.
Recent deep neural networks based techniques, especially those equipped with the ability of self-adaptation in the system level such as deep reinforcement learning (DRL), are shown to possess many advantages of optimizing robot learning systems (e. g., autonomous navigation and continuous robot arm control.)
Finally, in 8-channel conditions, a PESQ of 3. 12 is achieved using 20 million parameters for TTN, whereas a DNN with 68 million parameters can only attain a PESQ of 3. 06.
The Tensor-Train factorization (TTF) is an efficient way to compress large weight matrices of fully-connected layers and recurrent layers in recurrent neural networks (RNNs).
By incorporating prior knowledge of speech signals into the design of model structures, we build a model that is more data-efficient and achieves better generalization on both seen and unseen noise.
We propose a multi-objective framework to learn both secondary targets not directly related to the intended task of speech enhancement (SE) and the primary target of the clean log-power spectra (LPS) features to be used directly for constructing the enhanced speech signals.
A Song-On-Demand task, with a total of 38117 songs and 12 attributes corresponding to each song, is used to test the performance of the proposed approach.
We present a Bayesian approach to adapting parameters of a well-trained context-dependent, deep-neural-network, hidden Markov model (CD-DNN-HMM) to improve automatic speech recognition performance.