no code implementations • 3 Sep 2024 • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai
Transferring linguistic knowledge from a pretrained language model (PLM) to an acoustic model has been shown to greatly improve the performance of automatic speech recognition (ASR).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
1 code implementation • 16 Jun 2024 • Wenhao Yang, Jianguo Wei, Wenhuan Lu, Lei LI, Xugang Lu
To address this issue, we present a Channel Robust Speaker Learning (CRSL) framework that enhances the robustness of the current speaker verification pipeline, considering data source, data augmentation, and the efficiency of model transfer processes.
no code implementations • 8 Feb 2024 • Cho-Yuan Lee, Kuan-Chen Wang, Kai-Chun Liu, Yu-Te Wang, Xugang Lu, Ping-Cheng Yeh, Yu Tsao
In practical scenarios involving the measurement of surface electromyography (sEMG) in muscles, particularly those areas near the heart, one of the primary sources of contamination is the presence of electrocardiogram (ECG) signals.
no code implementations • 18 Dec 2023 • Peng Shen, Xugang Lu, Hisashi Kawai
Multi-talker overlapped speech recognition remains a significant challenge, requiring not only speech recognition but also speaker diarization tasks to be addressed.
no code implementations • 20 Oct 2023 • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai
Our previous study discovered that completely aligning the distributions between the source and target domains can introduce a negative transfer, where classes or irrelevant classes from the source domain map to a different class in the target domain during distribution alignment.
no code implementations • 28 Sep 2023 • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai
Due to the modality discrepancy between textual and acoustic modeling, efficiently transferring linguistic knowledge from a pretrained language model (PLM) to acoustic encoding for automatic speech recognition (ASR) still remains a challenging task.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 24 Sep 2023 • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai
Since the PLM is built from text while the acoustic model is trained with speech, a cross-modal alignment is required in order to transfer the context dependent linguistic knowledge from the PLM to acoustic encoding.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 29 Jul 2022 • Peng Shen, Xugang Lu, Hisashi Kawai
For Mandarin end-to-end (E2E) automatic speech recognition (ASR) tasks, compared to character-based modeling units, pronunciation-based modeling units could improve the sharing of modeling units in model training but meet homophone problems.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 8 Apr 2022 • Peng Shen, Xugang Lu, Hisashi Kawai
The acoustic and linguistic features are important cues for the spoken language identification (LID) task.
1 code implementation • 31 Mar 2022 • Rong Chao, Cheng Yu, Szu-Wei Fu, Xugang Lu, Yu Tsao
Specifically, the contrast of target features is stretched based on perceptual importance, thereby improving the overall SE performance.
Ranked #8 on Speech Enhancement on VoiceBank + DEMAND
no code implementations • 31 Mar 2022 • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai
In order to reduce domain discrepancy to improve the performance of cross-domain spoken language identification (SLID) system, as an unsupervised domain adaptation (UDA) method, we have proposed a joint distribution alignment (JDA) model based on optimal transport (OT).
no code implementations • 17 Mar 2022 • Ruiteng Zhang, Jianguo Wei, Xugang Lu, Wenhuan Lu, Di Jin, Junhai Xu, Lin Zhang, Yantao Ji, Jianwu Dang
Therefore, in the most current state-of-the-art network architectures, only a few branches corresponding to a limited number of temporal scales could be designed for speaker embeddings.
no code implementations • 24 Jan 2022 • Tassadaq Hussain, Wei-Chien Wang, Mandar Gogate, Kia Dashtipour, Yu Tsao, Xugang Lu, Adeel Ahsan, Amir Hussain
To address this problem, we propose to integrate a novel temporal attentive-pooling (TAP) mechanism into a conventional convolutional recurrent neural network, termed as TAP-CRNN.
1 code implementation • NeurIPS 2021 • Hsin-Yi Lin, Huan-Hsin Tseng, Xugang Lu, Yu Tsao
This paper presents a novel discriminator-constrained optimal transport network (DOTN) that performs unsupervised domain adaptation for speech enhancement (SE), which is an essential regression task in speech processing.
1 code implementation • 26 Oct 2021 • Ruiteng Zhang, Jianguo Wei, Wenhuan Lu, Lin Zhang, Yantao Ji, Junhai Xu, Xugang Lu
Automatic speaker verification (ASV) systems, which determine whether two speeches are from the same speaker, mainly focus on verification accuracy while ignoring inference speed.
3 code implementations • 8 Apr 2021 • Szu-Wei Fu, Cheng Yu, Tsun-An Hsieh, Peter Plantinga, Mirco Ravanelli, Xugang Lu, Yu Tsao
The discrepancy between the cost function used for training a speech enhancement model and human auditory perception usually makes the quality of enhanced speech unsatisfactory.
Ranked #16 on Speech Enhancement on VoiceBank + DEMAND
no code implementations • 7 Apr 2021 • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai
However, in most of the discriminative training for SiamNN, only the distribution of pair-wised sample distances is considered, and the additional discriminative information in joint distribution of samples is ignored.
no code implementations • 7 Feb 2021 • Yu-Wen Chen, Kuo-Hsuan Hung, Shang-Yi Chuang, Jonathan Sherman, Wen-Chin Huang, Xugang Lu, Yu Tsao
Synthesized speech from articulatory movements can have real-world use for patients with vocal cord disorders, situations requiring silent speech, or in high-noise environments.
no code implementations • 9 Jan 2021 • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai
By initializing the two-branch neural network with the generatively learned model parameters of the JB model, we train the model parameters with the pairwise samples as a binary discrimination task.
no code implementations • 24 Dec 2020 • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai
By minimizing the classification loss on the training data set with the adaptation loss on both training and testing data sets, the statistical distribution difference between training and testing domains is reduced.
no code implementations • 3 Nov 2020 • Yu-Wen Chen, Kuo-Hsuan Hung, Shang-Yi Chuang, Jonathan Sherman, Xugang Lu, Yu Tsao
Although deep learning algorithms are widely used for improving speech enhancement (SE) performance, the performance remains limited under highly challenging conditions, such as unseen noise or noise signals having low signal-to-noise ratios (SNRs).
1 code implementation • 28 Oct 2020 • Tsun-An Hsieh, Cheng Yu, Szu-Wei Fu, Xugang Lu, Yu Tsao
Speech enhancement (SE) aims to improve speech quality and intelligibility, which are both related to a smooth transition in speech segments that may carry linguistic information, e. g. phones and syllables.
Ranked #16 on Speech Enhancement on VoiceBank + DEMAND
no code implementations • 13 Aug 2020 • Yen-Ju Lu, Chien-Feng Liao, Xugang Lu, Jeih-weih Hung, Yu Tsao
In noisy conditions, knowing speech contents facilitates listeners to more effectively suppress background noise components and to retrieve pure speech signals.
1 code implementation • 6 Apr 2020 • Tsun-An Hsieh, Hsin-Min Wang, Xugang Lu, Yu Tsao
In WaveCRN, the speech locality feature is captured by a convolutional neural network (CNN), while the temporal sequential property of the locality feature is modeled by stacked simple recurrent units (SRU).
no code implementations • COLING 2020 • Haipeng Sun, Rui Wang, Kehai Chen, Xugang Lu, Masao Utiyama, Eiichiro Sumita, Tiejun Zhao
Unsupervised neural machine translation (UNMT) has recently attracted great interest in the machine translation community.
1 code implementation • 6 Jan 2020 • Cheng Yu, Ryandhimas E. Zezario, Jonathan Sherman, Yi-Yen Hsieh, Xugang Lu, Hsin-Min Wang, Yu Tsao
The DSDT is built based on a prior knowledge of speech and noisy conditions (the speaker, environment, and signal factors are considered in this paper), where each component of the multi-branched encoder performs a particular mapping from noisy to clean speech along the branch in the DSDT.
no code implementations • 27 Dec 2019 • Xugang Lu, Peng Shen, Sheng Li, Yu Tsao, Hisashi Kawai
However, a potential limitation of the network is that the discriminative features from the bottom layers (which can model the short-range dependency) are smoothed out in the final representation.
no code implementations • 30 Apr 2019 • Chien-Feng Liao, Yu Tsao, Xugang Lu, Hisashi Kawai
In this study, the symbolic sequences for acoustic signals are obtained as discrete representations with a Vector Quantized Variational Autoencoder algorithm.
no code implementations • 12 Sep 2017 • Szu-Wei Fu, Tao-Wei Wang, Yu Tsao, Xugang Lu, Hisashi Kawai
For example, in measuring speech intelligibility, most of the evaluation metric is based on a short-time objective intelligibility (STOI) measure, while the frame based minimum mean square error (MMSE) between estimated and clean speech is widely used in optimizing the model.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 27 Apr 2017 • Szu-Wei Fu, Ting-yao Hu, Yu Tsao, Xugang Lu
This paper aims to address two issues existing in the current speech enhancement methods: 1) the difficulty of phase estimations; 2) a single objective function cannot consider multiple metrics simultaneously.
no code implementations • 7 Mar 2017 • Szu-Wei Fu, Yu Tsao, Xugang Lu, Hisashi Kawai
Because the fully connected layers, which are involved in deep neural networks (DNN) and convolutional neural networks (CNN), may not accurately characterize the local information of speech signals, particularly with high frequency components, we employed fully convolutional layers to model the waveform.