no code implementations • 10 Mar 2025 • Kuo-Hsuan Hung, Xugang Lu, Szu-Wei Fu, Huan-Hsin Tseng, Hsin-Yi Lin, Chii-Wann Lin, Yu Tsao
In this study, we propose the Cross-Modality Knowledge Transfer (CMKT) learning framework, which leverages pre-trained large language models (LLMs) to infuse linguistic knowledge into SE models without requiring text input or LLMs during inference.
no code implementations • 7 Jan 2025 • Sung-Feng Huang, Heng-Cheng Kuo, Zhehuai Chen, Xuesong Yang, Chao-Han Huck Yang, Yu Tsao, Yu-Chiang Frank Wang, Hung-Yi Lee, Szu-Wei Fu
Neural speech editing advancements have raised concerns about their misuse in spoofing attacks.
no code implementations • 8 Nov 2024 • Yen-Ting Lin, Chao-Han Huck Yang, Zhehuai Chen, Piotr Zelasko, Xuesong Yang, Zih-Ching Chen, Krishna C Puvvada, Szu-Wei Fu, Ke Hu, Jun Wei Chiu, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank Wang
On zero-shot evaluation, NeKo outperforms GPT-3. 5 and Claude-Opus with $15. 5$% to $27. 6$% relative WER reduction in the Hyporadise benchmark.
1 code implementation • 29 Oct 2024 • Pin-Yen Huang, Szu-Wei Fu, Yu Tsao
In this paper, we present RankUp, a simple yet effective approach that adapts existing semi-supervised classification techniques to enhance the performance of regression tasks.
no code implementations • 30 Sep 2024 • Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-Yi Lee
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs) by incorporating pre-trained speech models.
1 code implementation • 24 Sep 2024 • Pin-Jui Ku, Alexander H. Liu, Roman Korostik, Sung-Feng Huang, Szu-Wei Fu, Ante Jukić
The proposed method is evaluated on multiple speech restoration tasks, including speech denoising, bandwidth extension, codec artifact removal, and target speaker extraction.
no code implementations • 27 Jun 2024 • Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, He Huang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-Yi Lee
Recent speech language models (SLMs) typically incorporate pre-trained speech models to extend the capabilities from large language models (LLMs).
1 code implementation • 10 May 2024 • Rong Chao, Wen-Huang Cheng, Moreno La Quatra, Sabato Marco Siniscalchi, Chao-Han Huck Yang, Szu-Wei Fu, Yu Tsao
This work aims to study a scalable state-space model (SSM), Mamba, for the speech enhancement (SE) task.
Ranked #4 on
Speech Enhancement
on VoiceBank + DEMAND
1 code implementation • 26 Feb 2024 • Szu-Wei Fu, Kuo-Hsuan Hung, Yu Tsao, Yu-Chiang Frank Wang
To improve the robustness of the encoder for SE, a novel self-distillation mechanism combined with adversarial training is introduced.
no code implementations • 15 Nov 2023 • Hsin-Tien Chiang, Szu-Wei Fu, Hsin-Min Wang, Yu Tsao, John H. L. Hansen
Furthermore, we demonstrated that incorporating SSL models resulted in greater transferability to OOD dataset.
1 code implementation • 22 Sep 2023 • Ryandhimas E. Zezario, Yu-Wen Chen, Szu-Wei Fu, Yu Tsao, Hsin-Min Wang, Chiou-Shann Fuh
This research introduces an enhanced version of the multi-objective speech assessment model--MOSA-Net+, by leveraging the acoustic features from Whisper, a large-scaled weakly supervised model.
no code implementations • 10 Jul 2023 • Hsin-Tien Chiang, Kuo-Hsuan Hung, Szu-Wei Fu, Heng-Cheng Kuo, Ming-Hsueh Tsai, Yu Tsao
Moreover, new objective measures are proposed that combine current objective measures using deep learning techniques to predict subjective quality and intelligibility.
1 code implementation • 30 Jun 2023 • Hsi-Che Lin, Chien-Yi Wang, Min-Hung Chen, Szu-Wei Fu, Yu-Chiang Frank Wang
This technical report describes our QuAVF@NTU-NVIDIA submission to the Ego4D Talking to Me (TTM) Challenge 2023.
no code implementations • 2 Apr 2023 • Szu-Wei Fu, Yaran Fan, Yasaman Hosseinkashi, Jayant Gupchup, Ross Cutler
In order to drive adoption of its usage to improve inclusiveness (and participation), we present a machine learning-based system that predicts when a meeting participant attempts to obtain the floor, but fails to interrupt (termed a `failed interruption').
no code implementations • 24 Oct 2022 • Quchen Fu, Szu-Wei Fu, Yaran Fan, Yu Wu, Zhuo Chen, Jayant Gupchup, Ross Cutler
Meetings are an essential form of communication for all types of organizations, and remote collaboration systems have been much more widely used since the COVID-19 pandemic.
1 code implementation • 7 Apr 2022 • Kuo-Hsuan Hung, Szu-Wei Fu, Huan-Hsin Tseng, Hsin-Tien Chiang, Yu Tsao, Chii-Wann Lin
We further study the relationship between the noise robustness of SSL representation via clean-noisy distance (CN distance) and the layer importance for SE.
Ranked #20 on
Speech Enhancement
on VoiceBank + DEMAND
no code implementations • 7 Apr 2022 • Ryandhimas E. Zezario, Szu-Wei Fu, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao
Recently, deep learning (DL)-based non-intrusive speech assessment models have attracted great attention.
1 code implementation • 31 Mar 2022 • Rong Chao, Cheng Yu, Szu-Wei Fu, Xugang Lu, Yu Tsao
Specifically, the contrast of target features is stretched based on perceptual importance, thereby improving the overall SE performance.
Ranked #14 on
Speech Enhancement
on VoiceBank + DEMAND
no code implementations • 10 Nov 2021 • Cheng Yu, Szu-Wei Fu, Tsun-An Hsieh, Yu Tsao, Mirco Ravanelli
Although deep learning (DL) has achieved notable progress in speech enhancement (SE), further research is still required for a DL-based SE system to adapt effectively and efficiently to particular speakers.
no code implementations • 8 Nov 2021 • Yu-Chen Lin, Cheng Yu, Yi-Te Hsu, Szu-Wei Fu, Yu Tsao, Tei-Wei Kuo
In this paper, a novel sign-exponent-only floating-point network (SEOFP-NET) technique is proposed to compress the model size and accelerate the inference time for speech enhancement, a regression task of speech signal processing.
1 code implementation • 3 Nov 2021 • Ryandhimas E. Zezario, Szu-Wei Fu, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao
Experimental results show that MOSA-Net can improve the linear correlation coefficient (LCC) by 0. 026 (0. 990 vs 0. 964 in seen noise environments) and 0. 012 (0. 969 vs 0. 957 in unseen noise environments) in perceptual evaluation of speech quality (PESQ) prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC by 0. 021 (0. 985 vs 0. 964 in seen noise environments) and 0. 047 (0. 836 vs 0. 789 in unseen noise environments) in short-time objective intelligibility (STOI) prediction, compared to STOI-Net (based on CRNN), an existing single-task model for STOI prediction.
2 code implementations • 12 Oct 2021 • Szu-Wei Fu, Cheng Yu, Kuo-Hsuan Hung, Mirco Ravanelli, Yu Tsao
Most of the deep learning-based speech enhancement models are learned in a supervised manner, which implies that pairs of noisy and clean speech are required during training.
4 code implementations • 8 Jun 2021 • Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato de Mori, Yoshua Bengio
SpeechBrain is an open-source and all-in-one speech toolkit.
3 code implementations • 8 Apr 2021 • Szu-Wei Fu, Cheng Yu, Tsun-An Hsieh, Peter Plantinga, Mirco Ravanelli, Xugang Lu, Yu Tsao
The discrepancy between the cost function used for training a speech enhancement model and human auditory perception usually makes the quality of enhanced speech unsatisfactory.
Ranked #23 on
Speech Enhancement
on VoiceBank + DEMAND
1 code implementation • 9 Nov 2020 • Ryandhimas E. Zezario, Szu-Wei Fu, Chiou-Shann Fuh, Yu Tsao, Hsin-Min Wang
To overcome this limitation, we propose a deep learning-based non-intrusive speech intelligibility assessment model, namely STOI-Net.
1 code implementation • 28 Oct 2020 • Tsun-An Hsieh, Cheng Yu, Szu-Wei Fu, Xugang Lu, Yu Tsao
Speech enhancement (SE) aims to improve speech quality and intelligibility, which are both related to a smooth transition in speech segments that may carry linguistic information, e. g. phones and syllables.
Ranked #23 on
Speech Enhancement
on VoiceBank + DEMAND
1 code implementation • 21 Aug 2020 • Yu-Wen Chen, Kuo-Hsuan Hung, You-Jin Li, Alexander Chao-Fu Kang, Ya-Hsin Lai, Kai-Chun Liu, Szu-Wei Fu, Syu-Siang Wang, Yu Tsao
The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC), allowing CITISEN to be used as a platform for utilizing and evaluating SE models and flexibly extend the models to address various noise environments and users.
no code implementations • 18 Jun 2020 • Szu-Wei Fu, Chien-Feng Liao, Tsun-An Hsieh, Kuo-Hsuan Hung, Syu-Siang Wang, Cheng Yu, Heng-Cheng Kuo, Ryandhimas E. Zezario, You-Jin Li, Shang-Yi Chuang, Yen-Ju Lu, Yu Tsao
The Transformer architecture has demonstrated a superior ability compared to recurrent neural networks in many different natural language processing applications.
1 code implementation • Interspeech 2020 • Haoyu Li, Szu-Wei Fu, Yu Tsao, Junichi Yamagishi
The intelligibility of natural speech is seriously degraded when exposed to adverse noisy environments.
Audio and Speech Processing Sound
no code implementations • 22 Nov 2019 • Cheng Yu, Kuo-Hsuan Hung, Syu-Siang Wang, Szu-Wei Fu, Yu Tsao, Jeih-weih Hung
Previous studies have proven that integrating video signals, as a complementary modality, can facilitate improved performance for speech enhancement (SE).
no code implementations • 26 Sep 2019 • Natalie Yu-Hsien Wang, Hsiao-Lan Sharon Wang, Tao-Wei Wang, Szu-Wei Fu, Xugan Lu, Yu Tsao, Hsin-Min Wang
Recently, a time-domain speech enhancement algorithm based on the fully convolutional neural networks (FCN) with a short-time objective intelligibility (STOI)-based objective function (termed FCN(S) in short) has received increasing attention due to its simple structure and effectiveness of restoring clean speech signals from noisy counterparts.
Denoising
Speech Enhancement
+1
Sound
Audio and Speech Processing
no code implementations • 26 Sep 2019 • Rung-Yu Tseng, Tao-Wei Wang, Szu-Wei Fu, Yu Tsao, Chia-Ying Lee
Speech perception is a key to verbal communication.
Speech Enhancement
Sound
Audio and Speech Processing
no code implementations • 31 May 2019 • Jyun-Yi Wu, Cheng Yu, Szu-Wei Fu, Chih-Ting Liu, Shao-Yi Chien, Yu Tsao
In addition, a parameter quantization (PQ) technique was applied to reduce the size of a neural network by representing weights with fewer cluster centroids.
5 code implementations • 13 May 2019 • Szu-Wei Fu, Chien-Feng Liao, Yu Tsao, Shou-De Lin
Adversarial loss in a conditional generative adversarial network (GAN) is not designed to directly optimize evaluation metrics of a target task, and thus, may not always guide the generator in a GAN to generate data with improved metric scores.
Ranked #37 on
Speech Enhancement
on VoiceBank + DEMAND
1 code implementation • 6 May 2019 • Szu-Wei Fu, Chien-Feng Liao, Yu Tsao
Utilizing a human-perception-related objective function to train a speech enhancement model has become a popular topic recently.
7 code implementations • 17 Apr 2019 • Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Junichi Yamagishi, Yu Tsao, Hsin-Min Wang
In this paper, we propose deep learning-based assessment models to predict human ratings of converted speech.
no code implementations • 17 Aug 2018 • Yi-Te Hsu, Yu-Chen Lin, Szu-Wei Fu, Yu Tsao, Tei-Wei Kuo
We evaluated the proposed EOFP quantization technique on two types of neural networks, namely, bidirectional long short-term memory (BLSTM) and fully convolutional neural network (FCN), on a speech enhancement task.
no code implementations • 16 Aug 2018 • Szu-Wei Fu, Yu Tsao, Hsin-Te Hwang, Hsin-Min Wang
The evaluation of utterance-level quality in Quality-Net is based on the frame-level assessment.
no code implementations • 12 Sep 2017 • Szu-Wei Fu, Tao-Wei Wang, Yu Tsao, Xugang Lu, Hisashi Kawai
For example, in measuring speech intelligibility, most of the evaluation metric is based on a short-time objective intelligibility (STOI) measure, while the frame based minimum mean square error (MMSE) between estimated and clean speech is widely used in optimizing the model.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • 27 Apr 2017 • Szu-Wei Fu, Ting-yao Hu, Yu Tsao, Xugang Lu
This paper aims to address two issues existing in the current speech enhancement methods: 1) the difficulty of phase estimations; 2) a single objective function cannot consider multiple metrics simultaneously.
no code implementations • 7 Mar 2017 • Szu-Wei Fu, Yu Tsao, Xugang Lu, Hisashi Kawai
Because the fully connected layers, which are involved in deep neural networks (DNN) and convolutional neural networks (CNN), may not accurately characterize the local information of speech signals, particularly with high frequency components, we employed fully convolutional layers to model the waveform.