no code implementations • 6 Feb 2024 • Liang-Hsuan Tseng, En-Pei Hu, Cheng-Han Chiang, Yuan Tseng, Hung-Yi Lee, Lin-shan Lee, Shao-Hua Sun
A word/phoneme in the speech signal is represented by a segment of speech signal with variable length and unknown boundary, and this segmental structure makes learning the mapping between speech and text challenging, especially without paired data.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 24 Jan 2024 • Chyi-Jiunn Lin, Guan-Ting Lin, Yung-Sung Chuang, Wei-Lun Wu, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Lin-shan Lee
However, the real-world problem of Open-domain SQA (openSQA), in which the machine needs to first retrieve passages that possibly contain the answer from a spoken archive in addition, was never considered.
1 code implementation • 9 Mar 2022 • Guan-Ting Lin, Yung-Sung Chuang, Ho-Lam Chung, Shu-wen Yang, Hsuan-Jui Chen, Shuyan Dong, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Lin-shan Lee
We empirically showed that DUAL yields results comparable to those obtained by cascading ASR and text QA model and robust to real-world data.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 4 Apr 2021 • Heng-Jui Chang, Hung-Yi Lee, Lin-shan Lee
We can collect new data describing the new environment and fine-tune the system, but this naturally leads to higher error rates for the earlier datasets, referred to as catastrophic forgetting.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
2 code implementations • 27 Oct 2020 • Yist Y. Lin, Chung-Ming Chien, Jheng-Hao Lin, Hung-Yi Lee, Lin-shan Lee
Any-to-any voice conversion aims to convert the voice from and to any speakers even unseen during training, which is much more challenging compared to one-to-one or many-to-many tasks, but much more attractive in real-world scenarios.
1 code implementation • 18 May 2020 • Chien-yu Huang, Yist Y. Lin, Hung-Yi Lee, Lin-shan Lee
We introduce human imperceptible noise into the utterances of a speaker whose voice is to be defended.
no code implementations • 5 May 2020 • Heng-Jui Chang, Alexander H. Liu, Hung-Yi Lee, Lin-shan Lee
Whispering is an important mode of human speech, but no end-to-end recognition results for it were reported yet, probably due to the scarcity of available whispered speech data.
no code implementations • 28 Oct 2019 • Alexander H. Liu, Tao Tu, Hung-Yi Lee, Lin-shan Lee
In this paper we propose a Sequential Representation Quantization AutoEncoder (SeqRQ-AE) to learn from primarily unpaired audio data and produce sequences of representations very close to phoneme sequences of speech utterances.
1 code implementation • 28 Oct 2019 • Gene-Ping Yang, Szu-Lin Wu, Yao-Wen Mao, Hung-Yi Lee, Lin-shan Lee
Permutation Invariant Training (PIT) has long been a stepping stone method for training speech separation model in handling the label ambiguity problem.
Ranked #22 on Speech Separation on WSJ0-2mix
1 code implementation • 28 Oct 2019 • Alexander H. Liu, Tzu-Wei Sung, Shun-Po Chuang, Hung-Yi Lee, Lin-shan Lee
This allows the decoder to consider the semantic consistency during decoding by absorbing the information carried by the transformed decoder feature, which is learned to be close to the target word embedding.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 25 Oct 2019 • Yung-Sung Chuang, Chi-Liang Liu, Hung-Yi Lee, Lin-shan Lee
In addition to the potential of end-to-end SQA, the SpeechBERT can also be considered for many other spoken language understanding tasks just as BERT for many text processing tasks.
Ranked #3 on Spoken Language Understanding on Spoken-SQuAD
1 code implementation • 16 Apr 2019 • Gene-Ping Yang, Chao-I Tuan, Hung-Yi Lee, Lin-shan Lee
Substantial effort has been reported based on approaches over spectrogram, which is well known as the standard time-and-frequency cross-domain representation for speech signals.
Ranked #24 on Speech Separation on WSJ0-2mix
no code implementations • 10 Apr 2019 • Yi-Chen Chen, Sung-Feng Huang, Hung-Yi Lee, Lin-shan Lee
However, we note human babies start to learn the language by the sounds (or phonetic structures) of a small number of exemplar words, and "generalize" such knowledge to other words without hearing a large amount of data.
no code implementations • 8 Apr 2019 • Kuan-Yu Chen, Che-Ping Tsai, Da-Rong Liu, Hung-Yi Lee, Lin-shan Lee
Producing a large annotated speech corpus for training ASR systems remains difficult for more than 95% of languages all over the world which are low-resourced, but collecting a relatively big unlabeled data set for such languages is more achievable.
no code implementations • 7 Nov 2018 • Sung-Feng Huang, Yi-Chen Chen, Hung-Yi Lee, Lin-shan Lee
Embedding audio signal segments into vectors with fixed dimensionality is attractive because all following processing will be easier and more efficient, for example modeling, classifying or indexing.
no code implementations • 2 Nov 2018 • Alexander H. Liu, Hung-Yi Lee, Lin-shan Lee
In this paper we proposed a novel Adversarial Training (AT) approach for end-to-end speech recognition using a Criticizing Language Model (CLM).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 30 Oct 2018 • Yi-Chen Chen, Chia-Hao Shen, Sung-Feng Huang, Hung-Yi Lee, Lin-shan Lee
This can be learned by aligning a small number of spoken words and the corresponding text words in the embedding spaces.
1 code implementation • 9 Aug 2018 • Cheng-chieh Yeh, Po-chun Hsu, Ju-chieh Chou, Hung-Yi Lee, Lin-shan Lee
In this way, the length constraint mentioned above is removed to offer rhythm-flexible voice conversion without requiring parallel data.
Sound Audio and Speech Processing
no code implementations • 7 Aug 2018 • Yu-Hsuan Wang, Hung-Yi Lee, Lin-shan Lee
In this paper, we extend audio Word2Vec from word-level to utterance-level by proposing a new segmental audio Word2Vec, in which unsupervised spoken word boundary segmentation and audio Word2Vec are jointly learned and mutually enhanced, so an utterance can be directly represented as a sequence of vectors carrying phonetic structure information.
no code implementations • 21 Jul 2018 • Yi-Chen Chen, Sung-Feng Huang, Chia-Hao Shen, Hung-Yi Lee, Lin-shan Lee
Stage 1 performs phonetic embedding with speaker characteristics disentangled.
no code implementations • 15 Apr 2018 • Che-Ping Tsai, Yi-Lin Tuan, Lin-shan Lee
Spoken content processing (such as retrieval and browsing) is maturing, but the singing content is still almost completely left out.
4 code implementations • 9 Apr 2018 • Ju-chieh Chou, Cheng-chieh Yeh, Hung-Yi Lee, Lin-shan Lee
The decoder then takes the speaker-independent latent representation and the target speaker embedding as the input to generate the voice of the target speaker with the linguistic content of the source utterance.
no code implementations • 7 Apr 2018 • Chih-Wei Lee, Yau-Shian Wang, Tsung-Yuan Hsu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee
Conventional seq2seq chatbot models only try to find the sentences with the highest probabilities conditioned on the input sequences, without considering the sentiment of the output sentences.
no code implementations • 1 Apr 2018 • Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee
Unsupervised discovery of acoustic tokens from audio corpora without annotation and learning vector representations for these tokens have been widely studied.
no code implementations • 28 Nov 2017 • Cheng-Tao Chung, Lin-shan Lee
In this paper, we compare two paradigms for unsupervised discovery of structured acoustic tokens directly from speech corpora without any human annotation.
no code implementations • 29 Oct 2017 • Zih-Wei Lin, Tzu-Wei Sung, Hung-Yi Lee, Lin-shan Lee
In this framework, universal background word vectors are first learned from the background corpora, and then adapted by the personalized corpus for each individual user to learn the personalized word vectors.
no code implementations • 16 Sep 2017 • Bo-Ru Lu, Frank Shyu, Yun-Nung Chen, Hung-Yi Lee, Lin-shan Lee
Connectionist temporal classification (CTC) is a powerful approach for sequence-to-sequence learning, and has been popularly used in speech recognition.
no code implementations • 17 Jul 2017 • Cheng-Tao Chung, Cheng-Yu Tsai, Chia-Hsiang Liu, Lin-shan Lee
A Multi-granular Acoustic Tokenizer (MAT) was proposed for automatic discovery of multiple sets of acoustic tokens from the given corpus.
no code implementations • 26 Dec 2016 • Lang-Chi Yu, Hung-Yi Lee, Lin-shan Lee
In this way, the model for abstractive headline generation for spoken content can be learned from abundant text data and the ASR data for some recognizers.
no code implementations • 16 Sep 2016 • Yen-chen Wu, Tzu-Hsiang Lin, Yang-De Chen, Hung-Yi Lee, Lin-shan Lee
In our previous work, some hand-crafted states estimated from the present retrieval results are used to determine the proper actions.
no code implementations • 28 Aug 2016 • Wei Fang, Jui-Yang Hsu, Hung-Yi Lee, Lin-shan Lee
Multimedia or spoken content presents more attractive information than plain text content, but the former is more difficult to display on a screen and be selected by a user.
no code implementations • 23 Aug 2016 • Bo-Hsiang Tseng, Sheng-syun Shen, Hung-Yi Lee, Lin-shan Lee
Multimedia or spoken content presents more attractive information than plain text content, but it's more difficult to display on a screen and be selected by a user.
1 code implementation • 3 Mar 2016 • Yu-An Chung, Chao-Chung Wu, Chia-Hao Shen, Hung-Yi Lee, Lin-shan Lee
The vector representations of fixed dimensionality for words (in text) offered by Word2Vec have been shown to be very useful in many application scenarios, in particular due to the semantic information they carry.
no code implementations • 1 Feb 2016 • Cheng-Tao Chung, Cheng-Yu Tsai, Hsiang-Hung Lu, Chia-Hsiang Liu, Hung-Yi Lee, Lin-shan Lee
The multiple sets of token labels are then used as the targets of a Multi-target Deep Neural Network (MDNN) trained on low-level acoustic features.
no code implementations • 8 Nov 2015 • Yi-Hsiu Liao, Hung-Yi Lee, Lin-shan Lee
In this paper we propose the Structured Deep Neural Network (structured DNN) as a structured and deep learning framework.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 7 Sep 2015 • Cheng-Tao Chung, Chun-an Chan, Lin-shan Lee
This paper presents a new approach for unsupervised Spoken Term Detection with spoken queries using multiple sets of acoustic patterns automatically discovered from the target corpus.
no code implementations • 7 Sep 2015 • Cheng-Tao Chung, Chun-an Chan, Lin-shan Lee
This linguistic structure includes two-level (subword-like and word-like) acoustic patterns, the lexicon of word-like patterns in terms of subword-like patterns and the N-gram language model based on word-like patterns.
no code implementations • 7 Sep 2015 • Cheng-Tao Chung, Wei-Ning Hsu, Cheng-Yi Lee, Lin-shan Lee
This paper presents a novel approach for enhancing the multiple sets of acoustic patterns automatically discovered from a given corpus.
no code implementations • 7 Jun 2015 • Cheng-Tao Chung, Cheng-Yu Tsai, Hsiang-Hung Lu, Yuan-ming Liou, Yen-chen Wu, Yen-Ju Lu, Hung-Yi Lee, Lin-shan Lee
The Multi-layered Acoustic Tokenizer (MAT) proposed in this work automatically discovers multiple sets of acoustic tokens from the given corpus.
no code implementations • 3 Jun 2015 • Yi-Hsiu Liao, Hung-Yi Lee, Lin-shan Lee
In this paper we propose the Structured Deep Neural Network (Structured DNN) as a structured and deep learning algorithm, learning to find the best structured object (such as a label sequence) given a structured input (such as a vector sequence) by globally considering the mapping relationships between the structure rather than item by item.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 3 Jun 2015 • Bo-Hsiang Tseng, Hung-Yi Lee, Lin-shan Lee
With the popularity of mobile devices, personalized speech recognizer becomes more realizable today and highly attractive.