no code implementations • 2 Jul 2024 • Syu-Siang Wang, Jia-Yang Chen, Bo-Ren Bai, Shih-Hau Fang, Yu Tsao
(2) The integration of the MaskQSS predictor enhanced the ability of HL-StarGAN to transform face mask voices into high-quality speech; this enhancement is evident in both objective and subjective tests, outperforming conventional StarGAN and CycleGAN-based systems.
no code implementations • 9 Jul 2023 • Wen-Yuan Ting, Syu-Siang Wang, Yu Tsao, Borching Su
Beamforming techniques are popular in speech-related applications due to their effective spatial filtering capabilities.
no code implementations • 22 Feb 2022 • Syu-Siang Wang, Chi-Te Wang, Chih-Chung Lai, Yu Tsao, Shih-Hau Fang
The experiments were conducted on a large-scale database, wherein 1, 045 continuous speech were collected by the speech clinic of a hospital from 2012 to 2019.
no code implementations • 19 Oct 2021 • Wen-Yuan Ting, Syu-Siang Wang, Hsin-Li Chang, Borching Su, Yu Tsao
Herein, we investigate a potential limitation of the clean-to-noisy conversion part and propose a novel noise-informed training (NIT) approach to improve the performance of the original CycleGAN SE system.
no code implementations • 19 Oct 2021 • Yun-Ju Chan, Chiang-Jen Peng, Syu-Siang Wang, Hsin-Min Wang, Yu Tsao, Tai-Shih Chi
Numerous voice conversion (VC) techniques have been proposed for the conversion of voices among different speakers.
1 code implementation • 7 Jan 2021 • Chiang-Jen Peng, Yun-Ju Chan, Cheng Yu, Syu-Siang Wang, Yu Tsao, Tai-Shih Chi
In this study, we propose an attention-based MTL (ATM) approach that integrates MTL and the attention-weighting mechanism to simultaneously realize a multi-model learning structure that performs speech enhancement (SE) and speaker identification (SI).
1 code implementation • 21 Aug 2020 • Yu-Wen Chen, Kuo-Hsuan Hung, You-Jin Li, Alexander Chao-Fu Kang, Ya-Hsin Lai, Kai-Chun Liu, Szu-Wei Fu, Syu-Siang Wang, Yu Tsao
The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC), allowing CITISEN to be used as a platform for utilizing and evaluating SE models and flexibly extend the models to address various noise environments and users.
no code implementations • 18 Jun 2020 • Szu-Wei Fu, Chien-Feng Liao, Tsun-An Hsieh, Kuo-Hsuan Hung, Syu-Siang Wang, Cheng Yu, Heng-Cheng Kuo, Ryandhimas E. Zezario, You-Jin Li, Shang-Yi Chuang, Yen-Ju Lu, Yu Tsao
The Transformer architecture has demonstrated a superior ability compared to recurrent neural networks in many different natural language processing applications.
no code implementations • 24 May 2020 • You-Jin Li, Syu-Siang Wang, Yu Tsao, Borching Su
For speech-related applications in IoT environments, identifying effective methods to handle interference noises and compress the amount of data in transmissions is essential to achieve high-quality services.
no code implementations • 22 Nov 2019 • Cheng Yu, Kuo-Hsuan Hung, Syu-Siang Wang, Szu-Wei Fu, Yu Tsao, Jeih-weih Hung
Previous studies have proven that integrating video signals, as a complementary modality, can facilitate improved performance for speech enhancement (SE).
no code implementations • 19 Nov 2019 • Syu-Siang Wang, Yu-You Liang, Jeih-weih Hung, Yu Tsao, Hsin-Min Wang, Shih-Hau Fang
Speech-related applications deliver inferior performance in complex noise environments.
1 code implementation • 8 Nov 2018 • Shih-kuang Lee, Syu-Siang Wang, Yu Tsao, Jeih-weih Hung
The presented DWT-based SE method with various scaling factors for the detail part is evaluated with a subset of Aurora-2 database, and the PESQ metric is used to indicate the quality of processed speech signals.
no code implementations • 1 Sep 2017 • Jen-Cheng Hou, Syu-Siang Wang, Ying-Hui Lai, Yu Tsao, Hsiu-Wen Chang, Hsin-Min Wang
Precisely speaking, the proposed AVDCNN model is structured as an audio-visual encoder-decoder network, in which audio and visual data are first processed using individual CNNs, and then fused into a joint network to generate enhanced speech (the primary task) and reconstructed images (the secondary task) at the output layer.
no code implementations • 30 Mar 2017 • Jen-Cheng Hou, Syu-Siang Wang, Ying-Hui Lai, Yu Tsao, Hsiu-Wen Chang, Hsin-Min Wang
Precisely speaking, the proposed AVDCNN model is structured as an audio-visual encoder-decoder network, in which audio and visual data are first processed using individual CNNs, and then fused into a joint network to generate enhanced speech (the primary task) and reconstructed images (the secondary task) at the output layer.