Search Results for author: Eng Siong Chng

Found 76 papers, 32 papers with code

Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization

no code implementations2 Jul 2024 Yuchen Hu, Chen Chen, Siyin Wang, Eng Siong Chng, Chao Zhang

By leveraging reverse inference as the standard to select exemplars used in RLHF from the speech samples generated by the TTS system itself, RIO steers the subsequent optimization towards a direction of enhancing the TTS robustness.

Inference Optimization Speech Synthesis +1

Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

1 code implementation25 Jun 2024 Duc-Tuan Truong, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong Aik Lee, Eng Siong Chng

Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts.

Synthetic Speech Detection

Towards Audio Codec-based Speech Separation

1 code implementation18 Jun 2024 Jia Qi Yip, Shengkui Zhao, Dianwen Ng, Eng Siong Chng, Bin Ma

Recent improvements in neural audio codec (NAC) models have generated interest in adopting pre-trained codecs for a variety of speech processing applications to take advantage of the efficiencies gained from high compression, but these have yet been applied to the speech separation (SS) task.

Edge-computing Speech Separation

Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback

no code implementations2 Jun 2024 Chen Chen, Yuchen Hu, Wen Wu, Helin Wang, Eng Siong Chng, Chao Zhang

In recent years, text-to-speech (TTS) technology has witnessed impressive advancements, particularly with large-scale training datasets, showcasing human-level speech quality and impressive zero-shot capabilities on unseen speakers.

Speech Synthesis Text-To-Speech Synthesis

Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

1 code implementation23 May 2024 Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Chengwei Qin, Pin-Yu Chen, Eng Siong Chng, Chao Zhang

We propose an unsupervised adaptation framework, Self-TAught Recognizer (STAR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models

no code implementations16 May 2024 Yuchen Hu, Chen Chen, Chengwei Qin, Qiushi Zhu, Eng Siong Chng, Ruizhe Li

Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which aims to predict the ground-truth transcription from the decoded N-best hypotheses.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Aligning Speech to Languages to Enhance Code-switching Speech Recognition

no code implementations9 Mar 2024 Hexin Liu, Xiangyu Zhang, Leibny Paola Garcia, Andy W. H. Khong, Eng Siong Chng, Shinji Watanabe

Performance evaluation using large language models reveals the advantage of the linguistic hint by achieving 14. 1% and 5. 5% relative improvement on test sets of the ASRU and SEAME datasets, respectively.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

1 code implementation10 Feb 2024 Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Dong Zhang, Zhehuai Chen, Eng Siong Chng

Leveraging the rich linguistic knowledge and strong reasoning abilities of LLMs, our new paradigm can integrate the rich information in N-best candidates to generate a higher-quality translation result.

Machine Translation Translation

ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge

no code implementations7 Jan 2024 He Wang, Pengcheng Guo, Yue Li, Ao Zhang, Jiayao Sun, Lei Xie, Wei Chen, Pan Zhou, Hui Bu, Xin Xu, BinBin Zhang, Zhuo Chen, Jian Wu, Longbiao Wang, Eng Siong Chng, Sun Li

To promote speech processing and recognition research in driving scenarios, we build on the success of the Intelligent Cockpit Speech Recognition Challenge (ICSRC) held at ISCSLP 2022 and launch the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Adapting OpenAI's Whisper for Speech Recognition on Code-Switch Mandarin-English SEAME and ASRU2019 Datasets

no code implementations29 Nov 2023 Yuhang Yang, Yizhou Peng, Xionghu Zhong, Hao Huang, Eng Siong Chng

The Mixed Error Rate results show that the amount of adaptation data may be as low as $1\sim10$ hours to achieve saturation in performance gain (SEAME) while the ASRU task continued to show performance with more adaptation data ($>$100 hours).

speech-recognition Speech Recognition

HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models

1 code implementation NeurIPS 2023 Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Sabato Macro Siniscalchi, Pin-Yu Chen, Eng Siong Chng

We make our results publicly accessible for reproducible pipelines with released pre-trained models, thus providing a new evaluation paradigm for ASR error correction with LLMs.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

SPGM: Prioritizing Local Features for enhanced speech separation performance

1 code implementation22 Sep 2023 Jia Qi Yip, Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang, Hao Wang, Trung Hieu Nguyen, Kun Zhou, Dianwen Ng, Eng Siong Chng, Bin Ma

Dual-path is a popular architecture for speech separation models (e. g. Sepformer) which splits long sequences into overlapping chunks for its intra- and inter-blocks that separately model intra-chunk local features and inter-chunk global relationships.

Speech Separation

Codec Data Augmentation for Time-domain Heart Sound Classification

no code implementations14 Sep 2023 Ansh Mishra, Jia Qi Yip, Eng Siong Chng

In this work, we propose a simple time domain approach, to the heart sound classification problem with a base classification error rate of 0. 8 and show that augmentation of the data through codec simulation can improve the classification error rate to 0. 2.

Classification Data Augmentation +1

Noise-aware Speech Enhancement using Diffusion Probabilistic Model

1 code implementation16 Jul 2023 Yuchen Hu, Chen Chen, Ruizhe Li, Qiushi Zhu, Eng Siong Chng

In this paper, we propose a noise-aware speech enhancement (NASE) approach that extracts noise-specific information to guide the reverse process in diffusion model.

Denoising Multi-Task Learning +2

Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition

1 code implementation18 Jun 2023 Yuchen Hu, Ruizhe Li, Chen Chen, Chengwei Qin, Qiushi Zhu, Eng Siong Chng

In this work, we investigate the noise-invariant visual modality to strengthen robustness of AVSR, which can adapt to any testing noises while without dependence on noisy training data, a. k. a., unsupervised noise adaptation.

Audio-Visual Speech Recognition speech-recognition +1

A Neural State-Space Model Approach to Efficient Speech Separation

1 code implementation26 May 2023 Chen Chen, Chao-Han Huck Yang, Kai Li, Yuchen Hu, Pin-Jui Ku, Eng Siong Chng

In this work, we introduce S4M, a new efficient speech separation framework based on neural state-space models (SSM).

Representation Learning Speech Separation

ACA-Net: Towards Lightweight Speaker Verification using Asymmetric Cross Attention

1 code implementation20 May 2023 Jia Qi Yip, Tuan Truong, Dianwen Ng, Chong Zhang, Yukun Ma, Trung Hieu Nguyen, Chongjia Ni, Shengkui Zhao, Eng Siong Chng, Bin Ma

In this paper, we propose ACA-Net, a lightweight, global context-aware speaker embedding extractor for Speaker Verification (SV) that improves upon existing work by using Asymmetric Cross Attention (ACA) to replace temporal pooling.

Speaker Verification

Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition

1 code implementation16 May 2023 Yuchen Hu, Ruizhe Li, Chen Chen, Heqing Zou, Qiushi Zhu, Eng Siong Chng

However, most existing AVSR approaches simply fuse the audio and visual features by concatenation, without explicit interactions to capture the deep correlations between them, which results in sub-optimal multimodal representations for downstream speech recognition task.

Audio-Visual Speech Recognition Automatic Speech Recognition +3

UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning

1 code implementation16 May 2023 Heqing Zou, Meng Shen, Chen Chen, Yuchen Hu, Deepu Rajan, Eng Siong Chng

Multimodal learning aims to imitate human beings to acquire complementary information from multiple modalities for various downstream tasks.

Contrastive Learning Image-text Classification +2

Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR

no code implementations11 Apr 2023 Yuchen Hu, Chen Chen, Qiushi Zhu, Eng Siong Chng

Second, during finetuning we propose a Transformer-based code predictor to accurately predict clean codes by modeling the global dependency of input noisy representations, which enables discovery and restoration of high-quality clean representations with reduced distortions.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Unsupervised Noise adaptation using Data Simulation

no code implementations23 Feb 2023 Chen Chen, Yuchen Hu, Heqing Zou, Linhui Sun, Eng Siong Chng

Deep neural network based speech enhancement approaches aim to learn a noisy-to-clean transformation using a supervised learning paradigm.

Domain Adaptation Generative Adversarial Network +1

Metric-oriented Speech Enhancement using Diffusion Probabilistic Model

no code implementations23 Feb 2023 Chen Chen, Yuchen Hu, Weiwei Weng, Eng Siong Chng

Deep neural network based speech enhancement technique focuses on learning a noisy-to-clean transformation supervised by paired training data.

Speech Enhancement

Unifying Speech Enhancement and Separation with Gradient Modulation for End-to-End Noise-Robust Speech Separation

1 code implementation22 Feb 2023 Yuchen Hu, Chen Chen, Heqing Zou, Xionghu Zhong, Eng Siong Chng

To alleviate this problem, we propose a novel network to unify speech enhancement and separation with gradient modulation to improve noise-robustness.

Multi-Task Learning Speech Enhancement +2

Gradient Remedy for Multi-Task Learning in End-to-End Noise-Robust Speech Recognition

1 code implementation22 Feb 2023 Yuchen Hu, Chen Chen, Ruizhe Li, Qiushi Zhu, Eng Siong Chng

In this paper, we propose a simple yet effective approach called gradient remedy (GR) to solve interference between task gradients in noise-robust speech recognition, from perspectives of both angle and magnitude.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Probabilistic Back-ends for Online Speaker Recognition and Clustering

1 code implementation19 Feb 2023 Alexey Sholokhov, Nikita Kuzmin, Kong Aik Lee, Eng Siong Chng

This paper focuses on multi-enrollment speaker recognition which naturally occurs in the task of online speaker clustering, and studies the properties of different scoring back-ends in this scenario.

Clustering Online Clustering +1

Improving Spoken Language Identification with Map-Mix

1 code implementation16 Feb 2023 Shangeth Rajaa, Kriti Anandan, Swaraj Dalmia, Tarun Gupta, Eng Siong Chng

The pre-trained multi-lingual XLSR model generalizes well for language identification after fine-tuning on unseen languages.

Data Augmentation Language Identification +1

Speech-text based multi-modal training with bidirectional attention for improved speech recognition

1 code implementation1 Nov 2022 Yuhang Yang, HaiHua Xu, Hao Huang, Eng Siong Chng, Sheng Li

To let the state-of-the-art end-to-end ASR model enjoy data efficiency, as well as much more unpaired text data by multi-modal training, one needs to address two problems: 1) the synchronicity of feature sampling rates between speech and language (aka text data); 2) the homogeneity of the learned representations from two encoders.

speech-recognition Speech Recognition

Amino Acid Classification in 2D NMR Spectra via Acoustic Signal Embeddings

no code implementations1 Aug 2022 Jia Qi Yip, Dianwen Ng, Bin Ma, Konstantin Pervushin, Eng Siong Chng

Nuclear Magnetic Resonance (NMR) is used in structural biology to experimentally determine the structure of proteins, which is used in many areas of biology and is an important part of drug development.

Speaker Verification

Continual Learning For On-Device Environmental Sound Classification

1 code implementation15 Jul 2022 Yang Xiao, Xubo Liu, James King, Arshdeep Singh, Eng Siong Chng, Mark D. Plumbley, Wenwu Wang

Experimental results on the DCASE 2019 Task 1 and ESC-50 dataset show that our proposed method outperforms baseline continual learning methods on classification accuracy and computational efficiency, indicating our method can efficiently and incrementally learn new classes without the catastrophic forgetting problem for on-device environmental sound classification.

Classification Computational Efficiency +3

Internal Language Model Estimation based Language Model Fusion for Cross-Domain Code-Switching Speech Recognition

no code implementations9 Jul 2022 Yizhou Peng, Yufei Liu, Jicheng Zhang, HaiHua Xu, Yi He, Hao Huang, Eng Siong Chng

More importantly, we train an end-to-end (E2E) speech recognition model by means of merging two monolingual data sets and observe the efficacy of the proposed ILME-based LM fusion for CSSR.

Language Modelling speech-recognition +1

Intermediate-layer output Regularization for Attention-based Speech Recognition with Shared Decoder

no code implementations9 Jul 2022 Jicheng Zhang, Yizhou Peng, HaiHua Xu, Yi He, Eng Siong Chng, Hao Huang

Intermediate layer output (ILO) regularization by means of multitask training on encoder side has been shown to be an effective approach to yielding improved results on a wide range of end-to-end ASR frameworks.

Decoder speech-recognition +1

Language-Based Audio Retrieval with Converging Tied Layers and Contrastive Loss

no code implementations29 Jun 2022 Andrew Koh, Eng Siong Chng

In this paper, we tackle the new Language-Based Audio Retrieval task proposed in DCASE 2022.

Retrieval

Self-critical Sequence Training for Automatic Speech Recognition

no code implementations13 Apr 2022 Chen Chen, Yuchen Hu, Nana Hou, Xiaofeng Qi, Heqing Zou, Eng Siong Chng

Although automatic speech recognition (ASR) task has gained remarkable success by sequence-to-sequence models, there are two main mismatches between its training and testing that might lead to performance degradation: 1) The typically used cross-entropy criterion aims to maximize log-likelihood of the training data, while the performance is evaluated by word error rate (WER), not log-likelihood; 2) The teacher-forcing method leads to the dependence on ground truth during training, which means that model has never been exposed to its own prediction before testing.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Rainbow Keywords: Efficient Incremental Learning for Online Spoken Keyword Spotting

1 code implementation30 Mar 2022 Yang Xiao, Nana Hou, Eng Siong Chng

Catastrophic forgetting is a thorny challenge when updating keyword spotting (KWS) models after deployment.

Data Augmentation Diversity +4

Interactive Audio-text Representation for Automated Audio Captioning with Contrastive Learning

no code implementations29 Mar 2022 Chen Chen, Nana Hou, Yuchen Hu, Heqing Zou, Xiaofeng Qi, Eng Siong Chng

Automated Audio captioning (AAC) is a cross-modal task that generates natural language to describe the content of input audio.

Audio captioning Contrastive Learning

Noise-robust Speech Recognition with 10 Minutes Unparalleled In-domain Data

no code implementations29 Mar 2022 Chen Chen, Nana Hou, Yuchen Hu, Shashank Shirol, Eng Siong Chng

Noise-robust speech recognition systems require large amounts of training data including noisy speech data and corresponding transcripts to achieve state-of-the-art performances in face of various practical environments.

Generative Adversarial Network Robust Speech Recognition +1

Speech Emotion Recognition with Co-Attention based Multi-level Acoustic Information

1 code implementation29 Mar 2022 Heqing Zou, Yuke Si, Chen Chen, Deepu Rajan, Eng Siong Chng

In this paper, we propose an end-to-end speech emotion recognition system using multi-level acoustic information with a newly designed co-attention module.

Speech Emotion Recognition

Dual-Path Style Learning for End-to-End Noise-Robust Speech Recognition

1 code implementation28 Mar 2022 Yuchen Hu, Nana Hou, Chen Chen, Eng Siong Chng

Then, we propose style learning to map the fused feature close to clean feature, in order to learn latent speech information from the latter, i. e., clean "speech style".

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

L-SpEx: Localized Target Speaker Extraction

1 code implementation21 Feb 2022 Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li

Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance.

Target Speaker Extraction

A Unified Speaker Adaptation Approach for ASR

1 code implementation EMNLP 2021 Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq Joty, Eng Siong Chng, Bin Ma

For model adaptation, we use a novel gradual pruning method to adapt to target speakers without changing the model architecture, which to the best of our knowledge, has never been explored in ASR.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Interactive Feature Fusion for End-to-End Noise-Robust Speech Recognition

2 code implementations11 Oct 2021 Yuchen Hu, Nana Hou, Chen Chen, Eng Siong Chng

Speech enhancement (SE) aims to suppress the additive noise from a noisy speech signal to improve the speech's perceptual quality and intelligibility.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Minimum word error training for non-autoregressive Transformer-based code-switching ASR

no code implementations7 Oct 2021 Yizhou Peng, Jicheng Zhang, HaiHua Xu, Hao Huang, Eng Siong Chng

Non-autoregressive end-to-end ASR framework might be potentially appropriate for code-switching recognition task thanks to its inherent property that present output token being independent of historical ones.

Automated Audio Captioning using Transfer Learning and Reconstruction Latent Space Similarity Regularization

no code implementations10 Aug 2021 Andrew Koh, Fuzhao Xue, Eng Siong Chng

In this paper, we examine the use of Transfer Learning using Pretrained Audio Neural Networks (PANNs), and propose an architecture that is able to better leverage the acoustic features provided by PANNs for the Automated Audio Captioning Task.

Audio captioning Decoder +1

E2E-based Multi-task Learning Approach to Joint Speech and Accent Recognition

no code implementations15 Jun 2021 Jicheng Zhang, Yizhou Peng, Pham Van Tung, HaiHua Xu, Hao Huang, Eng Siong Chng

In this paper, we propose a single multi-task learning framework to perform End-to-End (E2E) speech recognition (ASR) and accent recognition (AR) simultaneously.

Multi-Task Learning speech-recognition +1

End-to-End Speaker Height and age estimation using Attention Mechanism with LSTM-RNN

no code implementations13 Jan 2021 Manav Kaushik, Van Tung Pham, Eng Siong Chng

In this work, we propose a novel approach of using attention mechanism to build an end-to-end architecture for height and age estimation.

Age Estimation Multi-Task Learning

GDPNet: Refining Latent Multi-View Graph for Relation Extraction

1 code implementation12 Dec 2020 Fuzhao Xue, Aixin Sun, Hao Zhang, Eng Siong Chng

Recent advances on RE task are from BERT-based sequence modeling and graph-based modeling of relationships among the tokens in the sequence.

Ranked #4 on Dialog Relation Extraction on DialogRE (F1c (v1) metric)

Dialog Relation Extraction Dynamic Time Warping +2

Multilingual Approach to Joint Speech and Accent Recognition with DNN-HMM Framework

no code implementations22 Oct 2020 Yizhou Peng, Jicheng Zhang, Haobo Zhang, HaiHua Xu, Hao Huang, Eng Siong Chng

Experimental results on an 8-accent English speech recognition show both methods can yield WERs close to the conventional ASR systems that completely ignore the accent, as well as desired AR accuracy.

speech-recognition Speech Recognition +1

Approaches to Improving Recognition of Underrepresented Named Entities in Hybrid ASR Systems

no code implementations18 May 2020 Tingzhi Mao, Yerbolat Khassanov, Van Tung Pham, Hai-Hua Xu, Hao Huang, Eng Siong Chng

In this paper, we present a series of complementary approaches to improve the recognition of underrepresented named entities (NE) in hybrid ASR systems without compromising overall word error rate performance.

Language Modelling

SpEx+: A Complete Time Domain Speaker Extraction Network

no code implementations10 May 2020 Meng Ge, Cheng-Lin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li

To eliminate such mismatch, we propose a complete time-domain speaker extraction solution, that is called SpEx+.

Speech Extraction Audio and Speech Processing Sound

Time-domain speaker extraction network

no code implementations29 Apr 2020 Cheng-Lin Xu, Wei Rao, Eng Siong Chng, Haizhou Li

The inaccuracy of phase estimation is inherent to the frequency domain processing, that affects the quality of signal reconstruction.

Audio and Speech Processing Sound

SpEx: Multi-Scale Time Domain Speaker Extraction Network

1 code implementation17 Apr 2020 Cheng-Lin Xu, Wei Rao, Eng Siong Chng, Haizhou Li

Inspired by Conv-TasNet, we propose a time-domain speaker extraction network (SpEx) that converts the mixture speech into multi-scale embedding coefficients instead of decomposing the speech signal into magnitude and phase spectra.

Decoder Multi-Task Learning

Enriching Rare Word Representations in Neural Language Models by Embedding Matrix Augmentation

1 code implementation8 Apr 2019 Yerbolat Khassanov, Zhiping Zeng, Van Tung Pham, Hai-Hua Xu, Eng Siong Chng

However, learning the representation of rare words is a challenging problem causing the NLM to produce unreliable probability estimates.

speech-recognition Speech Recognition

On the End-to-End Solution to Mandarin-English Code-switching Speech Recognition

1 code implementation1 Nov 2018 Zhiping Zeng, Yerbolat Khassanov, Van Tung Pham, Hai-Hua Xu, Eng Siong Chng, Haizhou Li

Code-switching (CS) refers to a linguistic phenomenon where a speaker uses different languages in an utterance or between alternating utterances.

Data Augmentation Language Identification +3

Unsupervised and Efficient Vocabulary Expansion for Recurrent Neural Network Language Models in ASR

no code implementations27 Jun 2018 Yerbolat Khassanov, Eng Siong Chng

Additionally, we propose to generate the list of OOS words to expand vocabulary in unsupervised manner by automatically extracting them from ASR output.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Study of Semi-supervised Approaches to Improving English-Mandarin Code-Switching Speech Recognition

no code implementations16 Jun 2018 Pengcheng Guo, Hai-Hua Xu, Lei Xie, Eng Siong Chng

In this paper, we present our overall efforts to improve the performance of a code-switching speech recognition system using semi-supervised training methods from lexicon learning to acoustic modeling, on the South East Asian Mandarin-English (SEAME) data.

speech-recognition Speech Recognition

Spoofing detection under noisy conditions: a preliminary investigation and an initial database

no code implementations9 Feb 2016 Xiaohai Tian, Zhizheng Wu, Xiong Xiao, Eng Siong Chng, Haizhou Li

To simulate the real-life scenarios, we perform a preliminary investigation of spoofing detection under additive noisy conditions, and also describe an initial database for this task.

Speaker Verification

Cannot find the paper you are looking for? You can Submit a new open access paper.