Search Results for author: Lin-shan Lee

Found 41 papers, 9 papers with code

REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR

no code implementations6 Feb 2024 Liang-Hsuan Tseng, En-Pei Hu, Cheng-Han Chiang, Yuan Tseng, Hung-Yi Lee, Lin-shan Lee, Shao-Hua Sun

A word/phoneme in the speech signal is represented by a segment of speech signal with variable length and unknown boundary, and this segmental structure makes learning the mapping between speech and text challenging, especially without paired data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

SpeechDPR: End-to-End Spoken Passage Retrieval for Open-Domain Spoken Question Answering

no code implementations24 Jan 2024 Chyi-Jiunn Lin, Guan-Ting Lin, Yung-Sung Chuang, Wei-Lun Wu, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Lin-shan Lee

However, the real-world problem of Open-domain SQA (openSQA), in which the machine needs to first retrieve passages that possibly contain the answer from a spoken archive in addition, was never considered.

Passage Retrieval Question Answering +4

Towards Lifelong Learning of End-to-end ASR

no code implementations4 Apr 2021 Heng-Jui Chang, Hung-Yi Lee, Lin-shan Lee

We can collect new data describing the new environment and fine-tune the system, but this naturally leads to higher error rates for the earlier datasets, referred to as catastrophic forgetting.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention

2 code implementations27 Oct 2020 Yist Y. Lin, Chung-Ming Chien, Jheng-Hao Lin, Hung-Yi Lee, Lin-shan Lee

Any-to-any voice conversion aims to convert the voice from and to any speakers even unseen during training, which is much more challenging compared to one-to-one or many-to-many tasks, but much more attractive in real-world scenarios.

Disentanglement Speaker Verification +1

Defending Your Voice: Adversarial Attack on Voice Conversion

1 code implementation18 May 2020 Chien-yu Huang, Yist Y. Lin, Hung-Yi Lee, Lin-shan Lee

We introduce human imperceptible noise into the utterances of a speaker whose voice is to be defended.

Adversarial Attack Voice Conversion

End-to-end Whispered Speech Recognition with Frequency-weighted Approaches and Pseudo Whisper Pre-training

no code implementations5 May 2020 Heng-Jui Chang, Alexander H. Liu, Hung-Yi Lee, Lin-shan Lee

Whispering is an important mode of human speech, but no end-to-end recognition results for it were reported yet, probably due to the scarcity of available whispered speech data.

speech-recognition Speech Recognition +1

Towards Unsupervised Speech Recognition and Synthesis with Quantized Speech Representation Learning

no code implementations28 Oct 2019 Alexander H. Liu, Tao Tu, Hung-Yi Lee, Lin-shan Lee

In this paper we propose a Sequential Representation Quantization AutoEncoder (SeqRQ-AE) to learn from primarily unpaired audio data and produce sequences of representations very close to phoneme sequences of speech utterances.

Clustering Quantization +4

Interrupted and cascaded permutation invariant training for speech separation

1 code implementation28 Oct 2019 Gene-Ping Yang, Szu-Lin Wu, Yao-Wen Mao, Hung-Yi Lee, Lin-shan Lee

Permutation Invariant Training (PIT) has long been a stepping stone method for training speech separation model in handling the label ambiguity problem.

Speech Separation

Sequence-to-sequence Automatic Speech Recognition with Word Embedding Regularization and Fused Decoding

1 code implementation28 Oct 2019 Alexander H. Liu, Tzu-Wei Sung, Shun-Po Chuang, Hung-Yi Lee, Lin-shan Lee

This allows the decoder to consider the semantic consistency during decoding by absorbing the information carried by the transformed decoder feature, which is learned to be close to the target word embedding.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

SpeechBERT: An Audio-and-text Jointly Learned Language Model for End-to-end Spoken Question Answering

no code implementations25 Oct 2019 Yung-Sung Chuang, Chi-Liang Liu, Hung-Yi Lee, Lin-shan Lee

In addition to the potential of end-to-end SQA, the SpeechBERT can also be considered for many other spoken language understanding tasks just as BERT for many text processing tasks.

Language Modelling Question Answering +2

Improved Speech Separation with Time-and-Frequency Cross-domain Joint Embedding and Clustering

1 code implementation16 Apr 2019 Gene-Ping Yang, Chao-I Tuan, Hung-Yi Lee, Lin-shan Lee

Substantial effort has been reported based on approaches over spectrogram, which is well known as the standard time-and-frequency cross-domain representation for speech signals.

Clustering Speech Separation

From Semi-supervised to Almost-unsupervised Speech Recognition with Very-low Resource by Jointly Learning Phonetic Structures from Audio and Text Embeddings

no code implementations10 Apr 2019 Yi-Chen Chen, Sung-Feng Huang, Hung-Yi Lee, Lin-shan Lee

However, we note human babies start to learn the language by the sounds (or phonetic structures) of a small number of exemplar words, and "generalize" such knowledge to other words without hearing a large amount of data.

speech-recognition Speech Recognition +1

Completely Unsupervised Speech Recognition By A Generative Adversarial Network Harmonized With Iteratively Refined Hidden Markov Models

no code implementations8 Apr 2019 Kuan-Yu Chen, Che-Ping Tsai, Da-Rong Liu, Hung-Yi Lee, Lin-shan Lee

Producing a large annotated speech corpus for training ASR systems remains difficult for more than 95% of languages all over the world which are low-resourced, but collecting a relatively big unlabeled data set for such languages is more achievable.

Generative Adversarial Network speech-recognition +2

Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection

no code implementations7 Nov 2018 Sung-Feng Huang, Yi-Chen Chen, Hung-Yi Lee, Lin-shan Lee

Embedding audio signal segments into vectors with fixed dimensionality is attractive because all following processing will be easier and more efficient, for example modeling, classifying or indexing.

Clustering

Adversarial Training of End-to-end Speech Recognition Using a Criticizing Language Model

no code implementations2 Nov 2018 Alexander H. Liu, Hung-Yi Lee, Lin-shan Lee

In this paper we proposed a novel Adversarial Training (AT) approach for end-to-end speech recognition using a Criticizing Language Model (CLM).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Rhythm-Flexible Voice Conversion without Parallel Data Using Cycle-GAN over Phoneme Posteriorgram Sequences

1 code implementation9 Aug 2018 Cheng-chieh Yeh, Po-chun Hsu, Ju-chieh Chou, Hung-Yi Lee, Lin-shan Lee

In this way, the length constraint mentioned above is removed to offer rhythm-flexible voice conversion without requiring parallel data.

Sound Audio and Speech Processing

Segmental Audio Word2Vec: Representing Utterances as Sequences of Vectors with Applications in Spoken Term Detection

no code implementations7 Aug 2018 Yu-Hsuan Wang, Hung-Yi Lee, Lin-shan Lee

In this paper, we extend audio Word2Vec from word-level to utterance-level by proposing a new segmental audio Word2Vec, in which unsupervised spoken word boundary segmentation and audio Word2Vec are jointly learned and mutually enhanced, so an utterance can be directly represented as a sequence of vectors carrying phonetic structure information.

Segmentation

Transcribing Lyrics From Commercial Song Audio: The First Step Towards Singing Content Processing

no code implementations15 Apr 2018 Che-Ping Tsai, Yi-Lin Tuan, Lin-shan Lee

Spoken content processing (such as retrieval and browsing) is maturing, but the singing content is still almost completely left out.

Data Augmentation Retrieval

Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations

4 code implementations9 Apr 2018 Ju-chieh Chou, Cheng-chieh Yeh, Hung-Yi Lee, Lin-shan Lee

The decoder then takes the speaker-independent latent representation and the target speaker embedding as the input to generate the voice of the target speaker with the linguistic content of the source utterance.

Voice Conversion

Scalable Sentiment for Sequence-to-sequence Chatbot Response with Performance Analysis

no code implementations7 Apr 2018 Chih-Wei Lee, Yau-Shian Wang, Tsung-Yuan Hsu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee

Conventional seq2seq chatbot models only try to find the sentences with the highest probabilities conditioned on the input sequences, without considering the sentiment of the output sentences.

Chatbot reinforcement-learning +1

Completely Unsupervised Phoneme Recognition by Adversarially Learning Mapping Relationships from Audio Embeddings

no code implementations1 Apr 2018 Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee

Unsupervised discovery of acoustic tokens from audio corpora without annotation and learning vector representations for these tokens have been widely studied.

Generative Adversarial Network

Unsupervised Discovery of Structured Acoustic Tokens with Applications to Spoken Term Detection

no code implementations28 Nov 2017 Cheng-Tao Chung, Lin-shan Lee

In this paper, we compare two paradigms for unsupervised discovery of structured acoustic tokens directly from speech corpora without any human annotation.

Personalized word representations Carrying Personalized Semantics Learned from Social Network Posts

no code implementations29 Oct 2017 Zih-Wei Lin, Tzu-Wei Sung, Hung-Yi Lee, Lin-shan Lee

In this framework, universal background word vectors are first learned from the background corpora, and then adapted by the personalized corpus for each individual user to learn the personalized word vectors.

Sentence Sentence Completion

Order-Preserving Abstractive Summarization for Spoken Content Based on Connectionist Temporal Classification

no code implementations16 Sep 2017 Bo-Ru Lu, Frank Shyu, Yun-Nung Chen, Hung-Yi Lee, Lin-shan Lee

Connectionist temporal classification (CTC) is a powerful approach for sequence-to-sequence learning, and has been popularly used in speech recognition.

Abstractive Text Summarization General Classification +2

Unsupervised Iterative Deep Learning of Speech Features and Acoustic Tokens with Applications to Spoken Term Detection

no code implementations17 Jul 2017 Cheng-Tao Chung, Cheng-Yu Tsai, Chia-Hsiang Liu, Lin-shan Lee

A Multi-granular Acoustic Tokenizer (MAT) was proposed for automatic discovery of multiple sets of acoustic tokens from the given corpus.

Abstractive Headline Generation for Spoken Content by Attentive Recurrent Neural Networks with ASR Error Modeling

no code implementations26 Dec 2016 Lang-Chi Yu, Hung-Yi Lee, Lin-shan Lee

In this way, the model for abstractive headline generation for spoken content can be learned from abundant text data and the ASR data for some recognizers.

Abstractive Text Summarization Document Summarization +1

Interactive Spoken Content Retrieval by Deep Reinforcement Learning

no code implementations16 Sep 2016 Yen-chen Wu, Tzu-Hsiang Lin, Yang-De Chen, Hung-Yi Lee, Lin-shan Lee

In our previous work, some hand-crafted states estimated from the present retrieval results are used to determine the proper actions.

Q-Learning reinforcement-learning +4

Hierarchical Attention Model for Improved Machine Comprehension of Spoken Content

no code implementations28 Aug 2016 Wei Fang, Jui-Yang Hsu, Hung-Yi Lee, Lin-shan Lee

Multimedia or spoken content presents more attractive information than plain text content, but the former is more difficult to display on a screen and be selected by a user.

Reading Comprehension

Towards Machine Comprehension of Spoken Content: Initial TOEFL Listening Comprehension Test by Machine

no code implementations23 Aug 2016 Bo-Hsiang Tseng, Sheng-syun Shen, Hung-Yi Lee, Lin-shan Lee

Multimedia or spoken content presents more attractive information than plain text content, but it's more difficult to display on a screen and be selected by a user.

Reading Comprehension Sentence

Audio Word2Vec: Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence Autoencoder

1 code implementation3 Mar 2016 Yu-An Chung, Chao-Chung Wu, Chia-Hao Shen, Hung-Yi Lee, Lin-shan Lee

The vector representations of fixed dimensionality for words (in text) offered by Word2Vec have been shown to be very useful in many application scenarios, in particular due to the semantic information they carry.

Denoising Dynamic Time Warping

An Iterative Deep Learning Framework for Unsupervised Discovery of Speech Features and Linguistic Units with Applications on Spoken Term Detection

no code implementations1 Feb 2016 Cheng-Tao Chung, Cheng-Yu Tsai, Hsiang-Hung Lu, Chia-Hsiang Liu, Hung-Yi Lee, Lin-shan Lee

The multiple sets of token labels are then used as the targets of a Multi-target Deep Neural Network (MDNN) trained on low-level acoustic features.

Towards Structured Deep Neural Network for Automatic Speech Recognition

no code implementations8 Nov 2015 Yi-Hsiu Liao, Hung-Yi Lee, Lin-shan Lee

In this paper we propose the Structured Deep Neural Network (structured DNN) as a structured and deep learning framework.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Unsupervised Spoken Term Detection with Spoken Queries by Multi-level Acoustic Patterns with Varying Model Granularity

no code implementations7 Sep 2015 Cheng-Tao Chung, Chun-an Chan, Lin-shan Lee

This paper presents a new approach for unsupervised Spoken Term Detection with spoken queries using multiple sets of acoustic patterns automatically discovered from the target corpus.

Unsupervised Discovery of Linguistic Structure Including Two-level Acoustic Patterns Using Three Cascaded Stages of Iterative Optimization

no code implementations7 Sep 2015 Cheng-Tao Chung, Chun-an Chan, Lin-shan Lee

This linguistic structure includes two-level (subword-like and word-like) acoustic patterns, the lexicon of word-like patterns in terms of subword-like patterns and the N-gram language model based on word-like patterns.

Language Modelling

Enhancing Automatically Discovered Multi-level Acoustic Patterns Considering Context Consistency With Applications in Spoken Term Detection

no code implementations7 Sep 2015 Cheng-Tao Chung, Wei-Ning Hsu, Cheng-Yi Lee, Lin-shan Lee

This paper presents a novel approach for enhancing the multiple sets of acoustic patterns automatically discovered from a given corpus.

Towards Structured Deep Neural Network for Automatic Speech Recognition

no code implementations3 Jun 2015 Yi-Hsiu Liao, Hung-Yi Lee, Lin-shan Lee

In this paper we propose the Structured Deep Neural Network (Structured DNN) as a structured and deep learning algorithm, learning to find the best structured object (such as a label sequence) given a structured input (such as a vector sequence) by globally considering the mapping relationships between the structure rather than item by item.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Cannot find the paper you are looking for? You can Submit a new open access paper.