no code implementations • ROCLING 2021 • Cheng-Chung Fan, Chia-Chih Kuo, Shang-Bao Luo, Pei-Jun Liao, Kuang-Yu Chang, Chiao-Wei Hsu, Meng-Tse Wu, Shih-Hong Tsai, Tzu-Man Wu, Aleksandra Smolka, Chao-Chun Liang, Hsin-Min Wang, Kuan-Yu Chen, Yu Tsao, Keh-Yih Su
Only a few of them adopt several answer generation modules for providing different mechanisms; however, they either lack an aggregation mechanism to merge the answers from various modules, or are too complicated to be implemented with neural networks.
no code implementations • ROCLING 2022 • Aleksandra Smolka, Hsin-Min Wang, Jason S. Chang, Keh-Yih Su
This paper studies if the character trigram is still a suitable similarity measure for the task of aligning sentences in a paragraph paraphrasing corpus.
no code implementations • ROCLING 2022 • Shang-Bao Luo, Cheng-Chung Fan, Kuan-Yu Chen, Yu Tsao, Hsin-Min Wang, Keh-Yih Su
This paper also provides a baseline system and shows its performance on this dataset.
no code implementations • ROCLING 2021 • Shih-hung Tsai, Chao-Chun Liang, Hsin-Min Wang, Keh-Yih Su
We construct two math datasets and show the effectiveness of our algorithms that they can retrieve the required knowledge for problem-solving.
no code implementations • 15 Feb 2025 • Shafique Ahmed, Ryandhimas E. Zezario, Hui-Guan Yuan, Amir Hussain, Hsin-Min Wang, Wei-Ho Chung, Yu Tsao
To address this challenge, we present NeuroAMP, a novel deep neural network designed for end-to-end, personalized amplification in hearing aids.
no code implementations • 6 Dec 2024 • Jie Lin, I Chiu, Kuan-Chen Wang, Kai-Chun Liu, Hsin-Min Wang, Ping-Cheng Yeh, Yu Tsao
Electrocardiogram (ECG) signals play a crucial role in diagnosing cardiovascular diseases.
no code implementations • 14 Nov 2024 • Sahibzada Adil Shahzad, Ammarah Hashmi, Yan-Tsung Peng, Yu Tsao, Hsin-Min Wang
Extensive experiments are conducted on videos from a benchmark multimodal deepfake dataset to evaluate the detection performance of ChatGPT and compare it with the detection capabilities of state-of-the-art multimodal forensic models and humans.
no code implementations • 12 Nov 2024 • Ammarah Hashmi, Sahibzada Adil Shahzad, Chia-Wen Lin, Yu Tsao, Hsin-Min Wang
Deep Learning has been successfully applied in diverse fields, and its impact on deepfake detection is no exception.
no code implementations • 22 Sep 2024 • Wenze Ren, Kuo-Hsuan Hung, Rong Chao, YouJin Li, Hsin-Min Wang, Yu Tsao
This paper addresses the prevalent issue of incorrect speech output in audio-visual speech enhancement (AVSE) systems, which is often caused by poor video quality and mismatched training and test data.
1 code implementation • 19 Sep 2024 • Chien-Chun Wang, Li-Wei Chen, Cheng-Kang Chou, Hung-Shin Lee, Berlin Chen, Hsin-Min Wang
To mitigate this issue, we propose a novel channel-aware data simulation method for robust ASR training.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • 16 Sep 2024 • Ryandhimas E. Zezario, Sabato M. Siniscalchi, Hsin-Min Wang, Yu Tsao
We evaluate the assessment metrics predicted by GPT-4o and GPT-Whisper, examining their correlation with human-based quality and intelligibility assessments and the character error rate (CER) of automatic speech recognition.
no code implementations • 16 Sep 2024 • Wenze Ren, Haibin Wu, Yi-Cheng Lin, Xuanjun Chen, Rong Chao, Kuo-Hsuan Hung, You-Jin Li, Wen-Yuan Ting, Hsin-Min Wang, Yu Tsao
In multichannel speech enhancement, effectively capturing spatial and spectral information across different microphones is crucial for noise reduction.
no code implementations • 13 Sep 2024 • Yao-Fei Cheng, Li-Wei Chen, Hung-Shin Lee, Hsin-Min Wang
This study investigates the efficacy of data augmentation techniques for low-resource automatic speech recognition (ASR), focusing on two endangered Austronesian languages, Amis and Seediq.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+5
1 code implementation • 3 Sep 2024 • Chien-Chun Wang, Li-Wei Chen, Hung-Shin Lee, Berlin Chen, Hsin-Min Wang
Cross-domain speech enhancement (SE) is often faced with severe challenges due to the scarcity of noise and background information in an unseen target domain, leading to a mismatch between training and test conditions.
no code implementations • 12 Jun 2024 • Chun Yin, Tai-Shih Chi, Yu Tsao, Hsin-Min Wang
In this paper, we propose SVSNet+, a model that integrates pre-trained SFM representations to improve performance in assessing speaker voice similarity.
no code implementations • 7 May 2024 • Ammarah Hashmi, Sahibzada Adil Shahzad, Chia-Wen Lin, Yu Tsao, Hsin-Min Wang
Can we humans correctly perceive the authenticity of the content of the videos we watch?
1 code implementation • 10 Feb 2024 • Hsuan-Fu Wang, Yi-Jen Shih, Heng-Jui Chang, Layne Berry, Puyuan Peng, Hung-Yi Lee, Hsin-Min Wang, David Harwath
Second, we propose a new hybrid architecture that merges the cascaded and parallel architectures of SpeechCLIP into a multi-task learning framework.
1 code implementation • 2 Jan 2024 • Dyah A. M. G. Wisnu, Stefano Rini, Ryandhimas E. Zezario, Hsin-Min Wang, Yu Tsao
Experimental results demonstrate HAAQI-Net's effectiveness, achieving a Linear Correlation Coefficient (LCC) of 0. 9368 , a Spearman's Rank Correlation Coefficient (SRCC) of 0. 9486 , and a Mean Squared Error (MSE) of 0. 0064 and inference time significantly reduces from 62. 52 to 2. 54 seconds.
no code implementations • 28 Nov 2023 • Chi-Chang Lee, Hong-Wei Chen, Chu-Song Chen, Hsin-Min Wang, Tsung-Te Liu, Yu Tsao
The performance of speaker verification (SV) models may drop dramatically in noisy environments.
1 code implementation • 28 Nov 2023 • Chi-Chang Lee, Yu Tsao, Hsin-Min Wang, Chu-Song Chen
To our knowledge, this is the first work that deploys an effective combination scheme of regression (denoising) and classification (ASR) objectives to derive a general pre-processor applicable to various unseen ASR systems.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
no code implementations • 15 Nov 2023 • Hsin-Tien Chiang, Szu-Wei Fu, Hsin-Min Wang, Yu Tsao, John H. L. Hansen
Furthermore, we demonstrated that incorporating SSL models resulted in greater transferability to OOD dataset.
no code implementations • 5 Nov 2023 • Sahibzada Adil Shahzad, Ammarah Hashmi, Yan-Tsung Peng, Yu Tsao, Hsin-Min Wang
This study proposes a new method based on a multi-modal self-supervised-learning (SSL) feature extractor to exploit inconsistency between audio and visual modalities for multi-modal video forgery detection.
Ranked #1 on
DeepFake Detection
on FakeAVCeleb
(Accuracy (%) metric)
no code implementations • 19 Oct 2023 • Ammarah Hashmi, Sahibzada Adil Shahzad, Chia-Wen Lin, Yu Tsao, Hsin-Min Wang
For a detailed analysis, we evaluate AVTENet, its variants, and several existing methods on multiple test sets of the FakeAVCeleb dataset.
Ranked #2 on
DeepFake Detection
on FakeAVCeleb
(Accuracy (%) metric)
no code implementations • 4 Oct 2023 • Erica Cooper, Wen-Chin Huang, Yu Tsao, Hsin-Min Wang, Tomoki Toda, Junichi Yamagishi
We present the second edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthesized and processed speech.
1 code implementation • 22 Sep 2023 • Ryandhimas E. Zezario, Yu-Wen Chen, Szu-Wei Fu, Yu Tsao, Hsin-Min Wang, Chiou-Shann Fuh
This research introduces an enhanced version of the multi-objective speech assessment model--MOSA-Net+, by leveraging the acoustic features from Whisper, a large-scaled weakly supervised model.
no code implementations • 20 Sep 2023 • Shafique Ahmed, Chia-Wei Chen, Wenze Ren, Chin-Jou Li, Ernie Chu, Jun-Cheng Chen, Amir Hussain, Hsin-Min Wang, Yu Tsao, Jen-Cheng Hou
Recent studies have increasingly acknowledged the advantages of incorporating visual data into speech enhancement (SE) systems.
no code implementations • 18 Sep 2023 • Ryandhimas E. Zezario, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao
Automated speech intelligibility assessment is pivotal for hearing aid (HA) development.
no code implementations • 18 Aug 2023 • Ryandhimas E. Zezario, Bo-Ren Brian Bai, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao
This study proposes a multi-task pseudo-label learning (MPL)-based non-intrusive speech quality assessment model called MTQ-Net.
1 code implementation • 11 Dec 2022 • Yu-Wen Chen, Hsin-Min Wang, Yu Tsao
We converted the script into a speech corpus using two text-to-speech systems.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
1 code implementation • Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2022 • Ammarah Hashmi, Sahibzada Adil Shahzad, Wasim Ahmad, Chia Wen Lin, Yu Tsao, Hsin-Min Wang
The recent rapid revolution in Artificial Intelligence (AI) technology has enabled the creation of hyper-realistic deepfakes, and detecting deepfake videos (also known as AIsynthesized videos) has become a critical task.
Ranked #1 on
Multimodal Forgery Detection
on FakeAVCeleb
(using extra training data)
1 code implementation • APSIPA ASC 2022 2022 • Sahibzada Adil Shahzad, Ammarah Hashmi, Sarwar Khan, Yan-Tsung Peng, Yu Tsao, Hsin-Min Wang
Deepfake technology has advanced a lot, but it is a double-sided sword for the community.
Ranked #3 on
DeepFake Detection
on FakeAVCeleb
(Accuracy (%) metric)
2 code implementations • 27 Oct 2022 • Li-Wei Chen, Yao-Fei Cheng, Hung-Shin Lee, Yu Tsao, Hsin-Min Wang
The lack of clean speech is a practical challenge to the development of speech enhancement systems, which means that there is an inevitable mismatch between their training criterion and evaluation metric.
1 code implementation • 27 Oct 2022 • Fan-Lin Wang, Yao-Fei Cheng, Hung-Shin Lee, Yu Tsao, Hsin-Min Wang
In this study, inheriting the use of our previously constructed TAT-2mix corpus, we address the channel mismatch problem by proposing a channel-aware audio separation network (CasNet), a deep learning framework for end-to-end time-domain speech separation.
no code implementations • 21 Sep 2022 • Yin-Ping Cho, Yu Tsao, Hsin-Min Wang, Yi-Wen Liu
Singing voice synthesis (SVS) is the computer production of a human-like singing voice from given musical scores.
no code implementations • 18 Jun 2022 • Chi-Chang Lee, Cheng-Hung Hu, Yu-Chen Lin, Chu-Song Chen, Hsin-Min Wang, Yu Tsao
NASTAR uses a feedback mechanism to simulate adaptive training data via a noise extractor and a retrieval model.
1 code implementation • 9 Apr 2022 • Shih-kuang Lee, Yu Tsao, Hsin-Min Wang
This study investigated the cepstrogram properties and demonstrated their effectiveness as powerful countermeasures against replay attacks.
no code implementations • 7 Apr 2022 • Ryandhimas E. Zezario, Szu-Wei Fu, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao
Recently, deep learning (DL)-based non-intrusive speech assessment models have attracted great attention.
no code implementations • 7 Apr 2022 • Ryandhimas E. Zezario, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao
In this study, we propose a multi-branched speech intelligibility prediction model (MBI-Net), for predicting the subjective intelligibility scores of HA users.
no code implementations • 1 Apr 2022 • Chiang-Lin Tai, Hung-Shin Lee, Yu Tsao, Hsin-Min Wang
Children speech recognition is indispensable but challenging due to the diversity of children's speech.
1 code implementation • 30 Mar 2022 • Fan-Lin Wang, Hung-Shin Lee, Yu Tsao, Hsin-Min Wang
However, domain mismatch between training/test situations due to factors, such as speaker, content, channel, and environment, remains a severe problem for speech separation.
no code implementations • 30 Mar 2022 • Yu-Huai Peng, Hung-Shin Lee, Pin-Tuan Huang, Hsin-Min Wang
In traditional speaker diarization systems, a well-trained speaker model is a key component to extract representations from consecutive and partially overlapping segments in a long speech session.
no code implementations • 28 Mar 2022 • Hung-Shin Lee, Yu Tsao, Shyh-Kang Jeng, Hsin-Min Wang
Phonotactic constraints can be employed to distinguish languages by representing a speech utterance as a multinomial distribution or phone events.
1 code implementation • 25 Mar 2022 • Hung-Shin Lee, Pin-Yuan Chen, Yao-Fei Cheng, Yu Tsao, Hsin-Min Wang
In this paper, a noise-aware training framework based on two cascaded neural structures is proposed to jointly optimize speech enhancement and speech recognition.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • 25 Mar 2022 • Hung-Shin Lee, Pin-Tuan Huang, Yao-Fei Cheng, Hsin-Min Wang
For application to robust speech recognition, we further extend c-DcAE to hierarchical and parallel structures, resulting in hc-DcAE and pc-DcAE.
no code implementations • 14 Feb 2022 • Haibin Wu, Heng-Cheng Kuo, Naijun Zheng, Kuo-Hsuan Hung, Hung-Yi Lee, Yu Tsao, Hsin-Min Wang, Helen Meng
Also ADD 2022 is the first challenge to propose the partially fake audio detection task.
no code implementations • 14 Feb 2022 • Kuan-Chen Wang, Kai-Chun Liu, Hsin-Min Wang, Yu Tsao
Multimodal learning has been proven to be an effective method to improve speech enhancement (SE) performance, especially in challenging situations such as low signal-to-noise ratios, speech noise, or unseen noise types.
no code implementations • 10 Nov 2021 • Hsin-Tien Chiang, Yi-Chiao Wu, Cheng Yu, Tomoki Toda, Hsin-Min Wang, Yih-Chun Hu, Yu Tsao
Without the need of a clean reference, non-intrusive speech assessment methods have caught great attention for objective evaluations.
1 code implementation • 3 Nov 2021 • Ryandhimas E. Zezario, Szu-Wei Fu, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao
Experimental results show that MOSA-Net can improve the linear correlation coefficient (LCC) by 0. 026 (0. 990 vs 0. 964 in seen noise environments) and 0. 012 (0. 969 vs 0. 957 in unseen noise environments) in perceptual evaluation of speech quality (PESQ) prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC by 0. 021 (0. 985 vs 0. 964 in seen noise environments) and 0. 047 (0. 836 vs 0. 789 in unseen noise environments) in short-time objective intelligibility (STOI) prediction, compared to STOI-Net (based on CRNN), an existing single-task model for STOI prediction.
no code implementations • 19 Oct 2021 • Yun-Ju Chan, Chiang-Jen Peng, Syu-Siang Wang, Hsin-Min Wang, Yu Tsao, Tai-Shih Chi
Numerous voice conversion (VC) techniques have been proposed for the conversion of voices among different speakers.
no code implementations • 8 Sep 2021 • Yi-Syuan Liou, Wen-Chin Huang, Ming-Chi Yen, Shu-Wei Tsai, Yu-Huai Peng, Tomoki Toda, Yu Tsao, Hsin-Min Wang
Voice conversion (VC) is an effective approach to electrolaryngeal (EL) speech enhancement, a task that aims to improve the quality of the artificial voice from an electrolarynx device.
1 code implementation • 20 Jul 2021 • Cheng-Hung Hu, Yu-Huai Peng, Junichi Yamagishi, Yu Tsao, Hsin-Min Wang
Neural evaluation metrics derived for numerous speech generation tasks have recently attracted great attention.
no code implementations • 14 Jun 2021 • Fan-Lin Wang, Yu-Huai Peng, Hung-Shin Lee, Hsin-Min Wang
DPFN is composed of two parts: the speaker module and the separation module.
no code implementations • 10 Jun 2021 • Yi-Chiao Wu, Cheng-Hung Hu, Hung-Shin Lee, Yu-Huai Peng, Wen-Chin Huang, Yu Tsao, Hsin-Min Wang, Tomoki Toda
Nowadays, neural vocoders can generate very high-fidelity speech when a bunch of training data is available.
no code implementations • 2 Jun 2021 • Wen-Chin Huang, Kazuhiro Kobayashi, Yu-Huai Peng, Ching-Feng Liu, Yu Tsao, Hsin-Min Wang, Tomoki Toda
First, a powerful parallel sequence-to-sequence model converts the input dysarthric speech into a normal speech of a reference speaker as an intermediate product, and a nonparallel, frame-wise VC model realized with a variational autoencoder then converts the speaker identity of the reference speech back to that of the patient while assumed to be capable of preserving the enhanced quality.
1 code implementation • ACL 2021 • Shih-hung Tsai, Chao-Chun Liang, Hsin-Min Wang, Keh-Yih Su
With the recent advancements in deep learning, neural solvers have gained promising results in solving math word problems.
1 code implementation • 1 May 2021 • Yao-Fei Cheng, Hung-Shin Lee, Hsin-Min Wang
In this study, we survey methods to improve ST performance without using source transcription, and propose a learning framework that utilizes a language-independent universal phone recognizer.
no code implementations • 7 Apr 2021 • Cheng-Hung Hu, Yi-Chiao Wu, Wen-Chin Huang, Yu-Huai Peng, Yu-Wen Chen, Pin-Jui Ku, Tomoki Toda, Yu Tsao, Hsin-Min Wang
The first track focuses on using a small number of 100 target utterances for voice cloning, while the second track focuses on using only 5 target utterances for voice cloning.
no code implementations • 30 Jan 2021 • Wen-Chin Huang, Chia-Hua Wu, Shang-Bao Luo, Kuan-Yu Chen, Hsin-Min Wang, Tomoki Toda
We propose a simple method for automatic speech recognition (ASR) by fine-tuning BERT, which is a language model (LM) trained on large-scale unlabeled text data and can generate rich contextual representations.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
1 code implementation • 17 Dec 2020 • Ryandhimas E. Zezario, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao
Experimental results confirmed that the proposed ZMOS approach can achieve better performance in both seen and unseen noise types compared to the baseline systems and other model selection systems, which indicates the effectiveness of the proposed approach in providing robust SE performance.
1 code implementation • 9 Nov 2020 • Ryandhimas E. Zezario, Szu-Wei Fu, Chiou-Shann Fuh, Yu Tsao, Hsin-Min Wang
To overcome this limitation, we propose a deep learning-based non-intrusive speech intelligibility assessment model, namely STOI-Net.
no code implementations • 6 Oct 2020 • Yu-Huai Peng, Cheng-Hung Hu, Alexander Kang, Hung-Shin Lee, Pin-Yuan Chen, Yu Tsao, Hsin-Min Wang
This paper describes the Academia Sinica systems for the two tasks of Voice Conversion Challenge 2020, namely voice conversion within the same language (Task 1) and cross-lingual voice conversion (Task 2).
1 code implementation • 30 Aug 2020 • Shang-Yi Chuang, Hsin-Min Wang, Yu Tsao
Experimental results confirm that compared to conventional AVSE systems, iLAVSE can effectively overcome the aforementioned three practical issues and can improve enhancement performance.
1 code implementation • 24 May 2020 • Shang-Yi Chuang, Yu Tsao, Chen-Chou Lo, Hsin-Min Wang
Previous studies have confirmed the effectiveness of incorporating visual information into speech enhancement (SE) systems.
1 code implementation • 24 May 2020 • Chi-Chang Lee, Yu-Chen Lin, Hsuan-Tien Lin, Hsin-Min Wang, Yu Tsao
The results verify that the SERIL model can effectively adjust itself to new noise environments while overcoming the catastrophic forgetting issue.
5 code implementations • 6 Apr 2020 • Tsun-An Hsieh, Hsin-Min Wang, Xugang Lu, Yu Tsao
In WaveCRN, the speech locality feature is captured by a convolutional neural network (CNN), while the temporal sequential property of the locality feature is modeled by stacked simple recurrent units (SRU).
1 code implementation • 22 Jan 2020 • Wen-Chin Huang, Hao Luo, Hsin-Te Hwang, Chen-Chou Lo, Yu-Huai Peng, Yu Tsao, Hsin-Min Wang
In this paper, we extend the CDVAE-VC framework by incorporating the concept of adversarial learning, in order to further increase the degree of disentanglement, thereby improving the quality and similarity of converted speech.
1 code implementation • 6 Jan 2020 • Cheng Yu, Ryandhimas E. Zezario, Jonathan Sherman, Yi-Yen Hsieh, Xugang Lu, Hsin-Min Wang, Yu Tsao
The DSDT is built based on a prior knowledge of speech and noisy conditions (the speaker, environment, and signal factors are considered in this paper), where each component of the multi-branched encoder performs a particular mapping from noisy to clean speech along the branch in the DSDT.
no code implementations • 19 Nov 2019 • Syu-Siang Wang, Yu-You Liang, Jeih-weih Hung, Yu Tsao, Hsin-Min Wang, Shih-Hau Fang
Speech-related applications deliver inferior performance in complex noise environments.
no code implementations • 5 Nov 2019 • Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Hector Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, Lauri Juvela, Paavo Alku, Yu-Huai Peng, Hsin-Te Hwang, Yu Tsao, Hsin-Min Wang, Sebastien Le Maguer, Markus Becker, Fergus Henderson, Rob Clark, Yu Zhang, Quan Wang, Ye Jia, Kai Onuma, Koji Mushika, Takashi Kaneda, Yuan Jiang, Li-Juan Liu, Yi-Chiao Wu, Wen-Chin Huang, Tomoki Toda, Kou Tanaka, Hirokazu Kameoka, Ingmar Steiner, Driss Matrouf, Jean-Francois Bonastre, Avashna Govender, Srikanth Ronanki, Jing-Xuan Zhang, Zhen-Hua Ling
Spoofing attacks within a logical access (LA) scenario are generated with the latest speech synthesis and voice conversion technologies, including state-of-the-art neural acoustic and waveform model techniques.
no code implementations • 26 Sep 2019 • Chang-Le Liu, Sze-Wei Fu, You-Jin Li, Jen-Wei Huang, Hsin-Min Wang, Yu Tsao
We also propose an extended version of SDFCN, called the residual SDFCN (termed rSDFCN).
no code implementations • 26 Sep 2019 • Natalie Yu-Hsien Wang, Hsiao-Lan Sharon Wang, Tao-Wei Wang, Szu-Wei Fu, Xugan Lu, Yu Tsao, Hsin-Min Wang
Recently, a time-domain speech enhancement algorithm based on the fully convolutional neural networks (FCN) with a short-time objective intelligibility (STOI)-based objective function (termed FCN(S) in short) has received increasing attention due to its simple structure and effectiveness of restoring clean speech signals from noisy counterparts.
Denoising
Speech Enhancement
+1
Sound
Audio and Speech Processing
1 code implementation • 26 Aug 2019 • Hsiao-Tzu Hung, Chung-Yang Wang, Yi-Hsuan Yang, Hsin-Min Wang
In this paper, we tackle the problem of transfer learning for Jazz automatic generation.
1 code implementation • 2 May 2019 • Wen-Chin Huang, Yi-Chiao Wu, Chen-Chou Lo, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda, Yu Tsao, Hsin-Min Wang
Such hypothesis implies that during the conversion phase, the latent codes and the converted features in VAE based VC are in fact source F0 dependent.
7 code implementations • 17 Apr 2019 • Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Junichi Yamagishi, Yu Tsao, Hsin-Min Wang
In this paper, we propose deep learning-based assessment models to predict human ratings of converted speech.
no code implementations • 27 Nov 2018 • Wen-Chin Huang, Yi-Chiao Wu, Hsin-Te Hwang, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda, Yu Tsao, Hsin-Min Wang
Conventional WaveNet vocoders are trained with natural acoustic features but conditioned on the converted features in the conversion stage for VC, and such a mismatch often causes significant quality and similarity degradation.
1 code implementation • 29 Aug 2018 • Wen-Chin Huang, Hsin-Te Hwang, Yu-Huai Peng, Yu Tsao, Hsin-Min Wang
An effective approach to non-parallel voice conversion (VC) is to utilize deep neural networks (DNNs), specifically variational auto encoders (VAEs), to model the latent structure of speech in an unsupervised manner.
no code implementations • 16 Aug 2018 • Szu-Wei Fu, Yu Tsao, Hsin-Te Hwang, Hsin-Min Wang
The evaluation of utterance-level quality in Quality-Net is based on the frame-level assessment.
1 code implementation • 19 Jul 2018 • Chien-Feng Liao, Yu Tsao, Hung-Yi Lee, Hsin-Min Wang
The proposed noise adaptive SE system contains an encoder-decoder-based enhancement model and a domain discriminator model.
Sound Audio and Speech Processing
no code implementations • 1 Sep 2017 • Jen-Cheng Hou, Syu-Siang Wang, Ying-Hui Lai, Yu Tsao, Hsiu-Wen Chang, Hsin-Min Wang
Precisely speaking, the proposed AVDCNN model is structured as an audio-visual encoder-decoder network, in which audio and visual data are first processed using individual CNNs, and then fused into a joint network to generate enhanced speech (the primary task) and reconstructed images (the secondary task) at the output layer.
1 code implementation • 4 Apr 2017 • Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, Hsin-Min Wang
Building a voice conversion (VC) system from non-parallel speech corpora is challenging but highly valuable in real application scenarios.
no code implementations • 30 Mar 2017 • Jen-Cheng Hou, Syu-Siang Wang, Ying-Hui Lai, Yu Tsao, Hsiu-Wen Chang, Hsin-Min Wang
Precisely speaking, the proposed AVDCNN model is structured as an audio-visual encoder-decoder network, in which audio and visual data are first processed using individual CNNs, and then fused into a joint network to generate enhanced speech (the primary task) and reconstructed images (the secondary task) at the output layer.
no code implementations • COLING 2016 • Kuan-Yu Chen, Shih-Hung Liu, Berlin Chen, Hsin-Min Wang
The D-EV model not only inherits the advantages of the EV model but also can infer a more robust representation for a given spoken paragraph against imperfect speech recognition.
5 code implementations • 13 Oct 2016 • Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, Hsin-Min Wang
We propose a flexible framework for spectral conversion (SC) that facilitates training with unaligned corpora.
no code implementations • 13 Oct 2016 • Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, Hsin-Min Wang
In this paper, we propose a dictionary update method for Nonnegative Matrix Factorization (NMF) with high dimensional data in a spectral conversion (SC) task.
no code implementations • 22 Jul 2016 • Kuan-Yu Chen, Shih-Hung Liu, Berlin Chen, Hsin-Min Wang, Hsin-Hsi Chen
Word embedding methods revolve around learning continuous distributed vector representations of words with neural networks, which can capture semantic and/or syntactic cues, and in turn be used to induce similarity measures among words, sentences and documents in context.
no code implementations • 20 Jan 2016 • Kuan-Yu Chen, Shih-Hung Liu, Berlin Chen, Hsin-Min Wang
In addition to MMR, there is only a dearth of research concentrating on reducing redundancy or increasing diversity for the spoken document summarization task, as far as we are aware.
no code implementations • 14 Jun 2015 • Kuan-Yu Chen, Shih-Hung Liu, Hsin-Min Wang, Berlin Chen, Hsin-Hsi Chen
Owing to the rapidly growing multimedia content available on the Internet, extractive spoken document summarization, with the purpose of automatically selecting a set of representative sentences from a spoken document to concisely express the most important theme of the document, has been an active area of research and experimentation.