no code implementations • 5 Feb 2025 • Jixun Yao, Hexin Liu, Chen Chen, Yuchen Hu, EngSiong Chng, Lei Xie
To improve the stability of language model predictions, we propose a hierarchical modeling method that decouples the generation of clean semantic tokens and clean acoustic tokens into two distinct stages.
no code implementations • 31 Jan 2025 • Yuchen Hu, Xi Chen, Weidong Liu, Xiaojun Mao
Distributed stochastic optimization algorithms can simultaneously process large-scale datasets, significantly accelerating model training.
no code implementations • 27 Jan 2025 • Chen Chen, Yuchen Hu, Siyin Wang, Helin Wang, Zhehuai Chen, Chao Zhang, Chao-Han Huck Yang, Eng Siong Chng
Recent advances have enabled large language models (LLMs) to incorporate auditory systems for handling various speech-related tasks.
no code implementations • 23 Dec 2024 • Haoyang Li, Yuchen Hu, Chen Chen, Eng Siong Chng
High-fidelity speech enhancement often requires sophisticated modeling to capture intricate, multiscale patterns.
no code implementations • 23 Sep 2024 • Yuchen Hu, Yu Gu, Chenxing Li, Rilin Chen, Dong Yu
With recent advances of AIGC, video generation have gained a surge of research interest in both academia and industry (e. g., Sora).
no code implementations • 15 Sep 2024 • Chao-Han Huck Yang, Taejin Park, Yuan Gong, Yuanchao Li, Zhehuai Chen, Yen-Ting Lin, Chen Chen, Yuchen Hu, Kunal Dhawan, Piotr Żelasko, Chao Zhang, Yun-Nung Chen, Yu Tsao, Jagadeesh Balam, Boris Ginsburg, Sabato Marco Siniscalchi, Eng Siong Chng, Peter Bell, Catherine Lai, Shinji Watanabe, Andreas Stolcke
Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+5
1 code implementation • 11 Sep 2024 • Helin Wang, Meng Yu, Jiarui Hai, Chen Chen, Yuchen Hu, Rilin Chen, Najim Dehak, Dong Yu
In this paper, we introduce SSR-Speech, a neural codec autoregressive model designed for stable, safe, and robust zero-shot textbased speech editing and text-to-speech synthesis.
no code implementations • 7 Aug 2024 • Yuchen Dong, XiaoXiang Fang, Yuchen Hu, Renshuang Jiang, Zhe Jiang
Comparative experiments with SheetCopilot have demonstrated that the accumulation and recycling of task memories lead to a steady enhancement in task success rate, with an improvement rate of approximately 3%-6% per round in this implementation example.
no code implementations • 2 Jul 2024 • Yuchen Hu, Chen Chen, Siyin Wang, Eng Siong Chng, Chao Zhang
By leveraging reverse inference as the standard to select exemplars used in RLHF from the speech samples generated by the TTS system itself, RIO steers the subsequent optimization towards a direction of enhancing the TTS robustness.
no code implementations • 20 Jun 2024 • Ruohan Zhan, Shichao Han, Yuchen Hu, Zhenling Jiang
We show that the proposed estimator yields results comparable to the benchmark, whereas the standard difference-in-means estimator can exhibit significant bias and even produce reversed signs.
no code implementations • 2 Jun 2024 • Chen Chen, Yuchen Hu, Wen Wu, Helin Wang, Eng Siong Chng, Chao Zhang
In recent years, text-to-speech (TTS) technology has witnessed impressive advancements, particularly with large-scale training datasets, showcasing human-level speech quality and impressive zero-shot capabilities on unseen speakers.
1 code implementation • 23 May 2024 • Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Chengwei Qin, Pin-Yu Chen, Eng Siong Chng, Chao Zhang
We propose an unsupervised adaptation framework, Self-TAught Recognizer (STAR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 16 May 2024 • Yuchen Hu, Chen Chen, Chengwei Qin, Qiushi Zhu, Eng Siong Chng, Ruizhe Li
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which aims to predict the ground-truth transcription from the decoded N-best hypotheses.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • 16 May 2024 • Chen Chen, Ruizhe Li, Yuchen Hu, YuanYuan Chen, Chengwei Qin, Qiang Zhang
Experimental results show that HESIT effectively alleviates catastrophic forgetting by exemplar selection, and achieves state-of-the-art performance on the largest CL benchmark of ToDs in terms of all metrics.
no code implementations • 19 Apr 2024 • Chengwei Qin, Wenhan Xia, Tan Wang, Fangkai Jiao, Yuchen Hu, Bosheng Ding, Ruirui Chen, Shafiq Joty
One key finding in psychology is that compared with irrelevant past experiences, recalling relevant ones can help humans better handle new tasks.
1 code implementation • 10 Feb 2024 • Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Dong Zhang, Zhehuai Chen, Eng Siong Chng
Leveraging the rich linguistic knowledge and strong reasoning abilities of LLMs, our new paradigm can integrate the rich information in N-best candidates to generate a higher-quality translation result.
Ranked #1 on
Machine Translation
on FLoRes-200
1 code implementation • 8 Feb 2024 • Chen Chen, Ruizhe Li, Yuchen Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, EnSiong Chng, Chao-Han Huck Yang
Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.
Ranked #4 on
Speech Recognition
on WSJ eval92
(using extra training data)
Audio-Visual Speech Recognition
Automatic Speech Recognition
+3
1 code implementation • 19 Jan 2024 • Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Chao Zhang, Pin-Yu Chen, EnSiong Chng
To this end, we propose to extract a language-space noise embedding from the N-best list to represent the noise conditions of source speech, which can promote the denoising process in GER.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+6
1 code implementation • 7 Jan 2024 • Qiushi Zhu, Jie Zhang, Yu Gu, Yuchen Hu, LiRong Dai
Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose a multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs.
Audio-Visual Speech Recognition
Automatic Speech Recognition
+7
no code implementations • 28 Dec 2023 • Chengwei Qin, Wenhan Xia, Fangkai Jiao, Chen Chen, Yuchen Hu, Bosheng Ding, Shafiq Joty
Large language models (LLMs) have shown impressive few-shot generalization on many tasks via in-context learning (ICL).
no code implementations • 17 Oct 2023 • Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Hexin Liu, Sabato Marco Siniscalchi, Eng Siong Chng
In this work, we propose to leverage large language models (LLMs) and lists of hypotheses generated by an ASR to address the CS problem.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
1 code implementation • NeurIPS 2023 • Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Sabato Macro Siniscalchi, Pin-Yu Chen, Eng Siong Chng
We make our results publicly accessible for reproducible pipelines with released pre-trained models, thus providing a new evaluation paradigm for ASR error correction with LLMs.
Ranked #1 on
Speech Recognition
on TED-LIUM
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • 28 Aug 2023 • Qiushi Zhu, Yu Gu, Rilin Chen, Chao Weng, Yuchen Hu, LiRong Dai, Jie Zhang
Noise-robust TTS models are often trained using the enhanced speech, which thus suffer from speech distortion and background noise that affect the quality of the synthesized speech.
1 code implementation • 16 Jul 2023 • Yuchen Hu, Chen Chen, Ruizhe Li, Qiushi Zhu, Eng Siong Chng
In this paper, we propose a noise-aware speech enhancement (NASE) approach that extracts noise-specific information to guide the reverse process in diffusion model.
1 code implementation • 18 Jun 2023 • Yuchen Hu, Ruizhe Li, Chen Chen, Chengwei Qin, Qiushi Zhu, Eng Siong Chng
In this work, we investigate the noise-invariant visual modality to strengthen robustness of AVSR, which can adapt to any testing noises while without dependence on noisy training data, a. k. a., unsupervised noise adaptation.
1 code implementation • 18 Jun 2023 • Yuchen Hu, Chen Chen, Ruizhe Li, Heqing Zou, Eng Siong Chng
In this paper, we aim to learn the shared representations across modalities to bridge their gap.
1 code implementation • 26 May 2023 • Chen Chen, Chao-Han Huck Yang, Kai Li, Yuchen Hu, Pin-Jui Ku, Eng Siong Chng
In this work, we introduce S4M, a new efficient speech separation framework based on neural state-space models (SSM).
no code implementations • 23 May 2023 • Qiushi Zhu, Xiaoying Zhao, Jie Zhang, Yu Gu, Chao Weng, Yuchen Hu
Recently, many efforts have been made to explore how the brain processes speech using electroencephalographic (EEG) signals, where deep learning-based approaches were shown to be applicable in this field.
1 code implementation • 16 May 2023 • Heqing Zou, Meng Shen, Chen Chen, Yuchen Hu, Deepu Rajan, Eng Siong Chng
Multimodal learning aims to imitate human beings to acquire complementary information from multiple modalities for various downstream tasks.
1 code implementation • 16 May 2023 • Yuchen Hu, Ruizhe Li, Chen Chen, Heqing Zou, Qiushi Zhu, Eng Siong Chng
However, most existing AVSR approaches simply fuse the audio and visual features by concatenation, without explicit interactions to capture the deep correlations between them, which results in sub-optimal multimodal representations for downstream speech recognition task.
Audio-Visual Speech Recognition
Automatic Speech Recognition
+3
no code implementations • 11 Apr 2023 • Yuchen Hu, Chen Chen, Qiushi Zhu, Eng Siong Chng
Second, during finetuning we propose a Transformer-based code predictor to accurately predict clean codes by modeling the global dependency of input noisy representations, which enables discovery and restoration of high-quality clean representations with reduced distortions.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • 23 Feb 2023 • Chen Chen, Yuchen Hu, Weiwei Weng, Eng Siong Chng
Deep neural network based speech enhancement technique focuses on learning a noisy-to-clean transformation supervised by paired training data.
no code implementations • 23 Feb 2023 • Chen Chen, Yuchen Hu, Heqing Zou, Linhui Sun, Eng Siong Chng
Deep neural network based speech enhancement approaches aim to learn a noisy-to-clean transformation using a supervised learning paradigm.
1 code implementation • 22 Feb 2023 • Yuchen Hu, Chen Chen, Heqing Zou, Xionghu Zhong, Eng Siong Chng
To alleviate this problem, we propose a novel network to unify speech enhancement and separation with gradient modulation to improve noise-robustness.
1 code implementation • 22 Feb 2023 • Yuchen Hu, Chen Chen, Ruizhe Li, Qiushi Zhu, Eng Siong Chng
In this paper, we propose a simple yet effective approach called gradient remedy (GR) to solve interference between task gradients in noise-robust speech recognition, from perspectives of both angle and magnitude.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
no code implementations • 10 Dec 2022 • Chen Chen, Yuchen Hu, Qiang Zhang, Heqing Zou, Beier Zhu, Eng Siong Chng
Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition.
no code implementations • 24 Jun 2022 • Leilei Cao, Zhuang Li, Bo Yan, Feng Zhang, Fengliang Qi, Yuchen Hu, Hongbin Wang
The referring video object segmentation task (RVOS) aims to segment object instances in a given video referred by a language expression in all video frames.
no code implementations • 13 Apr 2022 • Chen Chen, Yuchen Hu, Nana Hou, Xiaofeng Qi, Heqing Zou, Eng Siong Chng
Although automatic speech recognition (ASR) task has gained remarkable success by sequence-to-sequence models, there are two main mismatches between its training and testing that might lead to performance degradation: 1) The typically used cross-entropy criterion aims to maximize log-likelihood of the training data, while the performance is evaluated by word error rate (WER), not log-likelihood; 2) The teacher-forcing method leads to the dependence on ground truth during training, which means that model has never been exposed to its own prediction before testing.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 29 Mar 2022 • Chen Chen, Nana Hou, Yuchen Hu, Heqing Zou, Xiaofeng Qi, Eng Siong Chng
Automated Audio captioning (AAC) is a cross-modal task that generates natural language to describe the content of input audio.
no code implementations • 29 Mar 2022 • Chen Chen, Nana Hou, Yuchen Hu, Shashank Shirol, Eng Siong Chng
Noise-robust speech recognition systems require large amounts of training data including noisy speech data and corresponding transcripts to achieve state-of-the-art performances in face of various practical environments.
1 code implementation • 28 Mar 2022 • Yuchen Hu, Nana Hou, Chen Chen, Eng Siong Chng
Then, we propose style learning to map the fused feature close to clean feature, in order to learn latent speech information from the latter, i. e., clean "speech style".
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • 24 Oct 2021 • Yuchen Hu, Stefan Wager
We consider off-policy evaluation of dynamic treatment rules under sequential ignorability, given an assumption that the underlying system can be modeled as a partially observed Markov decision process (POMDP).
2 code implementations • 11 Oct 2021 • Yuchen Hu, Nana Hou, Chen Chen, Eng Siong Chng
Speech enhancement (SE) aims to suppress the additive noise from a noisy speech signal to improve the speech's perceptual quality and intelligibility.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • ACL (IWSLT) 2021 • Dan Liu, Mengge Du, Xiaoxi Li, Yuchen Hu, LiRong Dai
This paper describes USTC-NELSLIP's submissions to the IWSLT2021 Simultaneous Speech Translation task.