Search Results for author: Qiushi Zhu

Found 9 papers, 5 papers with code

Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

1 code implementation7 Jan 2024 Qiushi Zhu, Jie Zhang, Yu Gu, Yuchen Hu, LiRong Dai

Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose a multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs.

Audio-Visual Speech Recognition Automatic Speech Recognition +7

Rep2wav: Noise Robust text-to-speech Using self-supervised representations

no code implementations28 Aug 2023 Qiushi Zhu, Yu Gu, Rilin Chen, Chao Weng, Yuchen Hu, LiRong Dai, Jie Zhang

Noise-robust TTS models are often trained using the enhanced speech, which thus suffer from speech distortion and background noise that affect the quality of the synthesized speech.

Speech Enhancement

Noise-aware Speech Enhancement using Diffusion Probabilistic Model

1 code implementation16 Jul 2023 Yuchen Hu, Chen Chen, Ruizhe Li, Qiushi Zhu, Eng Siong Chng

Specifically, we design a noise classification (NC) model to produce acoustic embedding as a noise conditioner for guiding the reverse denoising process.

Denoising Multi-Task Learning +2

Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition

1 code implementation18 Jun 2023 Yuchen Hu, Ruizhe Li, Chen Chen, Chengwei Qin, Qiushi Zhu, Eng Siong Chng

In this work, we investigate the noise-invariant visual modality to strengthen robustness of AVSR, which can adapt to any testing noises while without dependence on noisy training data, a. k. a., unsupervised noise adaptation.

Audio-Visual Speech Recognition speech-recognition +1

Eeg2vec: Self-Supervised Electroencephalographic Representation Learning

no code implementations23 May 2023 Qiushi Zhu, Xiaoying Zhao, Jie Zhang, Yu Gu, Chao Weng, Yuchen Hu

Recently, many efforts have been made to explore how the brain processes speech using electroencephalographic (EEG) signals, where deep learning-based approaches were shown to be applicable in this field.

EEG Representation Learning

Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition

1 code implementation16 May 2023 Yuchen Hu, Ruizhe Li, Chen Chen, Heqing Zou, Qiushi Zhu, Eng Siong Chng

However, most existing AVSR approaches simply fuse the audio and visual features by concatenation, without explicit interactions to capture the deep correlations between them, which results in sub-optimal multimodal representations for downstream speech recognition task.

Audio-Visual Speech Recognition Automatic Speech Recognition +3

Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR

no code implementations11 Apr 2023 Yuchen Hu, Chen Chen, Qiushi Zhu, Eng Siong Chng

Second, during finetuning we propose a Transformer-based code predictor to accurately predict clean codes by modeling the global dependency of input noisy representations, which enables discovery and restoration of high-quality clean representations with reduced distortions.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Gradient Remedy for Multi-Task Learning in End-to-End Noise-Robust Speech Recognition

1 code implementation22 Feb 2023 Yuchen Hu, Chen Chen, Ruizhe Li, Qiushi Zhu, Eng Siong Chng

In this paper, we propose a simple yet effective approach called gradient remedy (GR) to solve interference between task gradients in noise-robust speech recognition, from perspectives of both angle and magnitude.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

no code implementations21 Nov 2022 Qiushi Zhu, Long Zhou, Ziqiang Zhang, Shujie Liu, Binxing Jiao, Jie Zhang, LiRong Dai, Daxin Jiang, Jinyu Li, Furu Wei

Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e. g., vision, text.

Audio-Visual Speech Recognition Language Modelling +3

Cannot find the paper you are looking for? You can Submit a new open access paper.