Search Results for author: Yuchen Hu

Found 44 papers, 17 papers with code

GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling

no code implementations5 Feb 2025 Jixun Yao, Hexin Liu, Chen Chen, Yuchen Hu, EngSiong Chng, Lei Xie

To improve the stability of language model predictions, we propose a hierarchical modeling method that decouples the generation of clean semantic tokens and clean acoustic tokens into two distinct stages.

Language Modeling Language Modelling +1

A Bias-Correction Decentralized Stochastic Gradient Algorithm with Momentum Acceleration

no code implementations31 Jan 2025 Yuchen Hu, Xi Chen, Weidong Liu, Xiaojun Mao

Distributed stochastic optimization algorithms can simultaneously process large-scale datasets, significantly accelerating model training.

Stochastic Optimization

Audio Large Language Models Can Be Descriptive Speech Quality Evaluators

no code implementations27 Jan 2025 Chen Chen, Yuchen Hu, Siyin Wang, Helin Wang, Zhehuai Chen, Chao Zhang, Chao-Han Huck Yang, Eng Siong Chng

Recent advances have enabled large language models (LLMs) to incorporate auditory systems for handling various speech-related tasks.

Descriptive

An Investigation on the Potential of KAN in Speech Enhancement

no code implementations23 Dec 2024 Haoyang Li, Yuchen Hu, Chen Chen, Eng Siong Chng

High-fidelity speech enhancement often requires sophisticated modeling to capture intricate, multiscale patterns.

Kolmogorov-Arnold Networks Speech Enhancement

Video-to-Audio Generation with Fine-grained Temporal Semantics

no code implementations23 Sep 2024 Yuchen Hu, Yu Gu, Chenxing Li, Rilin Chen, Dong Yu

With recent advances of AIGC, video generation have gained a surge of research interest in both academia and industry (e. g., Sora).

Audio Generation Video Generation

SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis

1 code implementation11 Sep 2024 Helin Wang, Meng Yu, Jiarui Hai, Chen Chen, Yuchen Hu, Rilin Chen, Najim Dehak, Dong Yu

In this paper, we introduce SSR-Speech, a neural codec autoregressive model designed for stable, safe, and robust zero-shot textbased speech editing and text-to-speech synthesis.

Decoder Speech Synthesis +2

MaxMind: A Memory Loop Network to Enhance Software Productivity based on Large Language Models

no code implementations7 Aug 2024 Yuchen Dong, XiaoXiang Fang, Yuchen Hu, Renshuang Jiang, Zhe Jiang

Comparative experiments with SheetCopilot have demonstrated that the accumulation and recycling of task memories lead to a steady enhancement in task success rate, with an improvement rate of approximately 3%-6% per round in this implementation example.

Memorization Philosophy +1

Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization

no code implementations2 Jul 2024 Yuchen Hu, Chen Chen, Siyin Wang, Eng Siong Chng, Chao Zhang

By leveraging reverse inference as the standard to select exemplars used in RLHF from the speech samples generated by the TTS system itself, RIO steers the subsequent optimization towards a direction of enhancing the TTS robustness.

Inference Optimization Speech Synthesis +2

Estimating Treatment Effects under Recommender Interference: A Structured Neural Networks Approach

no code implementations20 Jun 2024 Ruohan Zhan, Shichao Han, Yuchen Hu, Zhenling Jiang

We show that the proposed estimator yields results comparable to the benchmark, whereas the standard difference-in-means estimator can exhibit significant bias and even produce reversed signs.

Recommendation Systems

Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback

no code implementations2 Jun 2024 Chen Chen, Yuchen Hu, Wen Wu, Helin Wang, Eng Siong Chng, Chao Zhang

In recent years, text-to-speech (TTS) technology has witnessed impressive advancements, particularly with large-scale training datasets, showcasing human-level speech quality and impressive zero-shot capabilities on unseen speakers.

Speech Synthesis Text to Speech +1

Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

1 code implementation23 May 2024 Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Chengwei Qin, Pin-Yu Chen, Eng Siong Chng, Chao Zhang

We propose an unsupervised adaptation framework, Self-TAught Recognizer (STAR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models

no code implementations16 May 2024 Yuchen Hu, Chen Chen, Chengwei Qin, Qiushi Zhu, Eng Siong Chng, Ruizhe Li

Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which aims to predict the ground-truth transcription from the decoded N-best hypotheses.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Overcoming Catastrophic Forgetting by Exemplar Selection in Task-oriented Dialogue System

no code implementations16 May 2024 Chen Chen, Ruizhe Li, Yuchen Hu, YuanYuan Chen, Chengwei Qin, Qiang Zhang

Experimental results show that HESIT effectively alleviates catastrophic forgetting by exemplar selection, and achieves state-of-the-art performance on the largest CL benchmark of ToDs in terms of all metrics.

Continual Learning Task-Oriented Dialogue Systems

Relevant or Random: Can LLMs Truly Perform Analogical Reasoning?

no code implementations19 Apr 2024 Chengwei Qin, Wenhan Xia, Tan Wang, Fangkai Jiao, Yuchen Hu, Bosheng Ding, Ruirui Chen, Shafiq Joty

One key finding in psychology is that compared with irrelevant past experiences, recalling relevant ones can help humans better handle new tasks.

GSM8K

GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

1 code implementation10 Feb 2024 Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Dong Zhang, Zhehuai Chen, Eng Siong Chng

Leveraging the rich linguistic knowledge and strong reasoning abilities of LLMs, our new paradigm can integrate the rich information in N-best candidates to generate a higher-quality translation result.

Machine Translation Speech-to-Speech Translation +1

It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

1 code implementation8 Feb 2024 Chen Chen, Ruizhe Li, Yuchen Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, EnSiong Chng, Chao-Han Huck Yang

Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.

Ranked #4 on Speech Recognition on WSJ eval92 (using extra training data)

Audio-Visual Speech Recognition Automatic Speech Recognition +3

Large Language Models are Efficient Learners of Noise-Robust Speech Recognition

1 code implementation19 Jan 2024 Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Chao Zhang, Pin-Yu Chen, EnSiong Chng

To this end, we propose to extract a language-space noise embedding from the N-best list to represent the noise conditions of source speech, which can promote the denoising process in GER.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +6

Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

1 code implementation7 Jan 2024 Qiushi Zhu, Jie Zhang, Yu Gu, Yuchen Hu, LiRong Dai

Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose a multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs.

Audio-Visual Speech Recognition Automatic Speech Recognition +7

Improving In-context Learning via Bidirectional Alignment

no code implementations28 Dec 2023 Chengwei Qin, Wenhan Xia, Fangkai Jiao, Chen Chen, Yuchen Hu, Bosheng Ding, Shafiq Joty

Large language models (LLMs) have shown impressive few-shot generalization on many tasks via in-context learning (ICL).

In-Context Learning

Rep2wav: Noise Robust text-to-speech Using self-supervised representations

no code implementations28 Aug 2023 Qiushi Zhu, Yu Gu, Rilin Chen, Chao Weng, Yuchen Hu, LiRong Dai, Jie Zhang

Noise-robust TTS models are often trained using the enhanced speech, which thus suffer from speech distortion and background noise that affect the quality of the synthesized speech.

Speech Enhancement Text to Speech

Noise-aware Speech Enhancement using Diffusion Probabilistic Model

1 code implementation16 Jul 2023 Yuchen Hu, Chen Chen, Ruizhe Li, Qiushi Zhu, Eng Siong Chng

In this paper, we propose a noise-aware speech enhancement (NASE) approach that extracts noise-specific information to guide the reverse process in diffusion model.

Denoising model +3

Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition

1 code implementation18 Jun 2023 Yuchen Hu, Ruizhe Li, Chen Chen, Chengwei Qin, Qiushi Zhu, Eng Siong Chng

In this work, we investigate the noise-invariant visual modality to strengthen robustness of AVSR, which can adapt to any testing noises while without dependence on noisy training data, a. k. a., unsupervised noise adaptation.

Audio-Visual Speech Recognition speech-recognition +1

A Neural State-Space Model Approach to Efficient Speech Separation

1 code implementation26 May 2023 Chen Chen, Chao-Han Huck Yang, Kai Li, Yuchen Hu, Pin-Jui Ku, Eng Siong Chng

In this work, we introduce S4M, a new efficient speech separation framework based on neural state-space models (SSM).

Representation Learning Speech Separation +1

Eeg2vec: Self-Supervised Electroencephalographic Representation Learning

no code implementations23 May 2023 Qiushi Zhu, Xiaoying Zhao, Jie Zhang, Yu Gu, Chao Weng, Yuchen Hu

Recently, many efforts have been made to explore how the brain processes speech using electroencephalographic (EEG) signals, where deep learning-based approaches were shown to be applicable in this field.

EEG Representation Learning

UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning

1 code implementation16 May 2023 Heqing Zou, Meng Shen, Chen Chen, Yuchen Hu, Deepu Rajan, Eng Siong Chng

Multimodal learning aims to imitate human beings to acquire complementary information from multiple modalities for various downstream tasks.

Contrastive Learning Image-text Classification +2

Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition

1 code implementation16 May 2023 Yuchen Hu, Ruizhe Li, Chen Chen, Heqing Zou, Qiushi Zhu, Eng Siong Chng

However, most existing AVSR approaches simply fuse the audio and visual features by concatenation, without explicit interactions to capture the deep correlations between them, which results in sub-optimal multimodal representations for downstream speech recognition task.

Audio-Visual Speech Recognition Automatic Speech Recognition +3

Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR

no code implementations11 Apr 2023 Yuchen Hu, Chen Chen, Qiushi Zhu, Eng Siong Chng

Second, during finetuning we propose a Transformer-based code predictor to accurately predict clean codes by modeling the global dependency of input noisy representations, which enables discovery and restoration of high-quality clean representations with reduced distortions.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Metric-oriented Speech Enhancement using Diffusion Probabilistic Model

no code implementations23 Feb 2023 Chen Chen, Yuchen Hu, Weiwei Weng, Eng Siong Chng

Deep neural network based speech enhancement technique focuses on learning a noisy-to-clean transformation supervised by paired training data.

model Speech Enhancement

Unsupervised Noise adaptation using Data Simulation

no code implementations23 Feb 2023 Chen Chen, Yuchen Hu, Heqing Zou, Linhui Sun, Eng Siong Chng

Deep neural network based speech enhancement approaches aim to learn a noisy-to-clean transformation using a supervised learning paradigm.

Domain Adaptation Generative Adversarial Network +1

Unifying Speech Enhancement and Separation with Gradient Modulation for End-to-End Noise-Robust Speech Separation

1 code implementation22 Feb 2023 Yuchen Hu, Chen Chen, Heqing Zou, Xionghu Zhong, Eng Siong Chng

To alleviate this problem, we propose a novel network to unify speech enhancement and separation with gradient modulation to improve noise-robustness.

Multi-Task Learning Speech Enhancement +2

Gradient Remedy for Multi-Task Learning in End-to-End Noise-Robust Speech Recognition

1 code implementation22 Feb 2023 Yuchen Hu, Chen Chen, Ruizhe Li, Qiushi Zhu, Eng Siong Chng

In this paper, we propose a simple yet effective approach called gradient remedy (GR) to solve interference between task gradients in noise-robust speech recognition, from perspectives of both angle and magnitude.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

The Second Place Solution for The 4th Large-scale Video Object Segmentation Challenge--Track 3: Referring Video Object Segmentation

no code implementations24 Jun 2022 Leilei Cao, Zhuang Li, Bo Yan, Feng Zhang, Fengliang Qi, Yuchen Hu, Hongbin Wang

The referring video object segmentation task (RVOS) aims to segment object instances in a given video referred by a language expression in all video frames.

Object object-detection +6

Self-critical Sequence Training for Automatic Speech Recognition

no code implementations13 Apr 2022 Chen Chen, Yuchen Hu, Nana Hou, Xiaofeng Qi, Heqing Zou, Eng Siong Chng

Although automatic speech recognition (ASR) task has gained remarkable success by sequence-to-sequence models, there are two main mismatches between its training and testing that might lead to performance degradation: 1) The typically used cross-entropy criterion aims to maximize log-likelihood of the training data, while the performance is evaluated by word error rate (WER), not log-likelihood; 2) The teacher-forcing method leads to the dependence on ground truth during training, which means that model has never been exposed to its own prediction before testing.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Interactive Audio-text Representation for Automated Audio Captioning with Contrastive Learning

no code implementations29 Mar 2022 Chen Chen, Nana Hou, Yuchen Hu, Heqing Zou, Xiaofeng Qi, Eng Siong Chng

Automated Audio captioning (AAC) is a cross-modal task that generates natural language to describe the content of input audio.

Audio captioning Contrastive Learning

Noise-robust Speech Recognition with 10 Minutes Unparalleled In-domain Data

no code implementations29 Mar 2022 Chen Chen, Nana Hou, Yuchen Hu, Shashank Shirol, Eng Siong Chng

Noise-robust speech recognition systems require large amounts of training data including noisy speech data and corresponding transcripts to achieve state-of-the-art performances in face of various practical environments.

Generative Adversarial Network Robust Speech Recognition +1

Dual-Path Style Learning for End-to-End Noise-Robust Speech Recognition

1 code implementation28 Mar 2022 Yuchen Hu, Nana Hou, Chen Chen, Eng Siong Chng

Then, we propose style learning to map the fused feature close to clean feature, in order to learn latent speech information from the latter, i. e., clean "speech style".

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Off-Policy Evaluation in Partially Observed Markov Decision Processes under Sequential Ignorability

no code implementations24 Oct 2021 Yuchen Hu, Stefan Wager

We consider off-policy evaluation of dynamic treatment rules under sequential ignorability, given an assumption that the underlying system can be modeled as a partially observed Markov decision process (POMDP).

Off-policy evaluation

Interactive Feature Fusion for End-to-End Noise-Robust Speech Recognition

2 code implementations11 Oct 2021 Yuchen Hu, Nana Hou, Chen Chen, Eng Siong Chng

Speech enhancement (SE) aims to suppress the additive noise from a noisy speech signal to improve the speech's perceptual quality and intelligibility.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Cannot find the paper you are looking for? You can Submit a new open access paper.