Search Results for author: Hung-Yi Lee

Found 229 papers, 91 papers with code

Is BERT a Cross-Disciplinary Knowledge Learner? A Surprising Finding of Pre-trained Models’ Transferability

no code implementations Findings (EMNLP) 2021 Wei-Tsung Kao, Hung-Yi Lee

This paper investigates whether the power of the models pre-trained on text data, such as BERT, can be transferred to general token sequence classification applications.

text-classification Text Classification

Self-supervised Representation Learning for Speech Processing

1 code implementation NAACL (ACL) 2022 Hung-Yi Lee, Abdelrahman Mohamed, Shinji Watanabe, Tara Sainath, Karen Livescu, Shang-Wen Li, Shu-wen Yang, Katrin Kirchhoff

Due to the growing popularity of SSL, and the shared mission of the areas in bringing speech and language technologies to more use cases with better quality and scaling the technologies for under-represented languages, we propose this tutorial to systematically survey the latest SSL techniques, tools, datasets, and performance achievement in speech processing.

Representation Learning

Codec-SUPERB: An In-Depth Analysis of Sound Codec Models

1 code implementation20 Feb 2024 Haibin Wu, Ho-Lam Chung, Yi-Cheng Lin, Yuan-Kuei Wu, Xuanjun Chen, Yu-Chi Pai, Hsiu-Hsuan Wang, Kai-Wei Chang, Alexander H. Liu, Hung-Yi Lee

The sound codec's dual roles in minimizing data transmission latency and serving as tokenizers underscore its critical importance.

Towards audio language modeling - an overview

no code implementations20 Feb 2024 Haibin Wu, Xuanjun Chen, Yi-Cheng Lin, Kai-Wei Chang, Ho-Lam Chung, Alexander H. Liu, Hung-Yi Lee

Neural audio codecs are initially introduced to compress audio data into compact codes to reduce transmission latency.

Language Modelling

Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations

no code implementations20 Feb 2024 Guan-Ting Lin, Cheng-Han Chiang, Hung-Yi Lee

When using text-only LLMs to model spoken dialogue, text-only LLMs cannot give different responses based on the speaking style of the current turn.

Sentence

Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations

1 code implementation8 Feb 2024 Cheng-Han Chiang, Hung-Yi Lee

We show that LLMs can generate paragraphs that contain verifiable facts, but the facts are combined to form a non-factual paragraph due to entity ambiguity.

REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR

no code implementations6 Feb 2024 Liang-Hsuan Tseng, En-Pei Hu, Cheng-Han Chiang, Yuan Tseng, Hung-Yi Lee, Lin-shan Lee, Shao-Hua Sun

A word/phoneme in the speech signal is represented by a segment of speech signal with variable length and unknown boundary, and this segmental structure makes learning the mapping between speech and text challenging, especially without paired data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

SpeechDPR: End-to-End Spoken Passage Retrieval for Open-Domain Spoken Question Answering

no code implementations24 Jan 2024 Chyi-Jiunn Lin, Guan-Ting Lin, Yung-Sung Chuang, Wei-Lun Wu, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Lin-shan Lee

However, the real-world problem of Open-domain SQA (openSQA), in which the machine needs to first retrieve passages that possibly contain the answer from a spoken archive in addition, was never considered.

Passage Retrieval Question Answering +4

Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization

no code implementations23 Jan 2024 Wei-Ping Huang, Sung-Feng Huang, Hung-Yi Lee

This paper presents an effective transfer learning framework for language adaptation in text-to-speech systems, with a focus on achieving language adaptation using minimal labeled and unlabeled data.

Transfer Learning

Examining Forgetting in Continual Pre-training of Aligned Large Language Models

no code implementations6 Jan 2024 Chen-An Li, Hung-Yi Lee

Recent advances in Large Language Models (LLMs) have exhibited remarkable proficiency across various tasks.

Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks

no code implementations5 Jan 2024 Kevin Everson, Yile Gu, Huck Yang, Prashanth Gurunath Shivakumar, Guan-Ting Lin, Jari Kolehmainen, Ivan Bulyko, Ankur Gandhe, Shalini Ghosh, Wael Hamza, Hung-Yi Lee, Ariya Rastrow, Andreas Stolcke

In the realm of spoken language understanding (SLU), numerous natural language understanding (NLU) methodologies have been adapted by supplying large language models (LLMs) with transcribed speech instead of conventional written text.

In-Context Learning intent-classification +6

PEFT for Speech: Unveiling Optimal Placement, Merging Strategies, and Ensemble Techniques

no code implementations4 Jan 2024 Tzu-Han Lin, How-Shing Wang, Hao-Yung Weng, Kuang-Chen Peng, Zih-Ching Chen, Hung-Yi Lee

Our study conducts extensive experiments to compare different PEFT methods and their layer-wise placement adapting Differentiable Architecture Search (DARTS).

Ensemble Learning Self-Supervised Learning

Investigating Zero-Shot Generalizability on Mandarin-English Code-Switched ASR and Speech-to-text Translation of Recent Foundation Models with Self-Supervision and Weak Supervision

no code implementations30 Dec 2023 Chih-Kai Yang, Kuan-Po Huang, Ke-Han Lu, Chun-Yi Kuan, Chi-Yuan Hsiao, Hung-Yi Lee

This work evaluated several cutting-edge large-scale foundation models based on self-supervision or weak supervision, including SeamlessM4T, SeamlessM4T v2, and Whisper-large-v3, on three code-switched corpora.

Speech-to-Text Translation

Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

no code implementations23 Dec 2023 Guan-Ting Lin, Prashanth Gurunath Shivakumar, Ankur Gandhe, Chao-Han Huck Yang, Yile Gu, Shalini Ghosh, Andreas Stolcke, Hung-Yi Lee, Ivan Bulyko

Specifically, our framework serializes tasks in the order of current paralinguistic attribute prediction, response paralinguistic attribute prediction, and response text generation with autoregressive conditioning.

Attribute Language Modelling +4

An Exploration of In-Context Learning for Speech Language Model

no code implementations19 Oct 2023 Ming-Hao Hsu, Kai-Wei Chang, Shang-Wen Li, Hung-Yi Lee

Despite the success of ICL in NLP, little work is exploring the possibility of ICL in speech processing.

Few-Shot Learning In-Context Learning +1

Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models

no code implementations17 Oct 2023 Hsuan Su, Cheng-Chu Cheng, Hua Farn, Shachi H Kumar, Saurav Sahay, Shang-Tse Chen, Hung-Yi Lee

Recently, researchers have made considerable improvements in dialogue systems with the progress of large language models (LLMs) such as ChatGPT and GPT-4.

In-Context Learning

Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond

no code implementations9 Oct 2023 Jiatong Shi, William Chen, Dan Berrebbi, Hsiu-Hsuan Wang, Wei-Ping Huang, En-Pei Hu, Ho-Lam Chuang, Xuankai Chang, Yuxun Tang, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Shinji Watanabe

The 2023 Multilingual Speech Universal Performance Benchmark (ML-SUPERB) Challenge expands upon the acclaimed SUPERB framework, emphasizing self-supervised models in multilingual speech recognition and language identification.

Language Identification speech-recognition +1

A Closer Look into Automatic Evaluation Using Large Language Models

1 code implementation9 Oct 2023 Cheng-Han Chiang, Hung-Yi Lee

In this paper, we analyze LLM evaluation (Chiang and Lee, 2023) and G-Eval (Liu et al., 2023), and we discuss how those details in the evaluation process change how well the ratings given by LLMs correlate with human ratings.

Chat Vector: A Simple Approach to Equip LLMs with Instruction Following and Model Alignment in New Languages

no code implementations7 Oct 2023 Shih-Cheng Huang, Pin-Zu Li, Yu-Chi Hsu, Kuang-Ming Chen, Yu Tung Lin, Shih-Kai Hsiao, Richard Tzong-Han Tsai, Hung-Yi Lee

By simply adding the chat vector to a continual pre-trained model's weights, we can endow the model with chat capabilities in new languages without the need for further training.

Instruction Following

Zero Resource Code-switched Speech Benchmark Using Speech Utterance Pairs For Multiple Spoken Languages

1 code implementation4 Oct 2023 Kuan-Po Huang, Chih-Kai Yang, Yu-Kuan Fu, Ewan Dunbar, Hung-Yi Lee

We introduce a new zero resource code-switched speech benchmark designed to directly assess the code-switching capabilities of self-supervised speech encoders.

Language Modelling

Investigating Human-Identifiable Features Hidden in Adversarial Perturbations

no code implementations28 Sep 2023 Dennis Y. Menn, Tzu-hsun Feng, Sriram Vishwanath, Hung-Yi Lee

Our study contributes to a deeper understanding of the underlying mechanisms behind adversarial attacks and offers insights for the development of more resilient defense strategies for neural networks.

Towards General-Purpose Text-Instruction-Guided Voice Conversion

no code implementations25 Sep 2023 Chun-Yi Kuan, Chen An Li, Tsu-Yuan Hsu, Tse-Yang Lin, Ho-Lam Chung, Kai-Wei Chang, Shuo-Yiin Chang, Hung-Yi Lee

This paper introduces a novel voice conversion (VC) model, guided by text instructions such as "articulate slowly with a deep tone" or "speak in a cheerful boyish voice".

Language Modelling Specificity +1

Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

1 code implementation18 Sep 2023 Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan Sharma, Shinji Watanabe, Bhiksha Ramakrishnan, Shady Shehata, Hung-Yi Lee

To achieve comprehensive coverage of diverse speech tasks and harness instruction tuning, we invite the community to collaborate and contribute, facilitating the dynamic growth of the benchmark.

Improving Non-autoregressive Translation Quality with Pretrained Language Model, Embedding Distillation and Upsampling Strategy for CTC

no code implementations10 Jun 2023 Shen-sian Syu, Juncheng Xie, Hung-Yi Lee

In our experiments, our model outperforms the baseline autoregressive model (Transformer \textit{base}) on multiple datasets, including WMT'14 DE$\leftrightarrow$EN, WMT'16 RO$\leftrightarrow$EN, and IWSLT'14 DE$\leftrightarrow$EN.

Language Modelling Pretrained Multilingual Language Models +1

Revealing the Blind Spot of Sentence Encoder Evaluation by HEROS

no code implementations8 Jun 2023 Cheng-Han Chiang, Yung-Sung Chuang, James Glass, Hung-Yi Lee

We also show that even if two SEs have similar performance on STS benchmarks, they can have very different behavior on HEROS.

Negation Sentence +1

Why We Should Report the Details in Subjective Evaluation of TTS More Rigorously

1 code implementation3 Jun 2023 Cheng-Han Chiang, Wei-Ping Huang, Hung-Yi Lee

This paper emphasizes the importance of reporting experiment details in subjective evaluations and demonstrates how such details can significantly impact evaluation results in the field of speech synthesis.

Speech Synthesis

SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts

no code implementations3 Jun 2023 Haibin Wu, Kai-Wei Chang, Yuan-Kuei Wu, Hung-Yi Lee

In this paper, we present pioneering research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen, with around 10M trainable parameters.

Open-Ended Question Answering

How to Estimate Model Transferability of Pre-Trained Speech Models?

1 code implementation1 Jun 2023 Zih-Ching Chen, Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Shuo-Yiin Chang, Rohit Prabhavalkar, Hung-Yi Lee, Tara N. Sainath

In this work, we introduce a "score-based assessment" framework for estimating the transferability of pre-trained speech models (PSMs) for fine-tuning target tasks.

MiniSUPERB: Lightweight Benchmark for Self-supervised Speech Models

1 code implementation30 May 2023 Yu-Hsiang Wang, Huang-Yu Chen, Kai-Wei Chang, Winston Hsu, Hung-Yi Lee

In this paper, we introduce MiniSUPERB, a lightweight benchmark that efficiently evaluates SSL speech models with comparable results to SUPERB but lower computational costs significantly.

Self-Supervised Learning

The defender's perspective on automatic speaker verification: An overview

no code implementations22 May 2023 Haibin Wu, Jiawen Kang, Lingwei Meng, Helen Meng, Hung-Yi Lee

Automatic speaker verification (ASV) plays a critical role in security-sensitive environments.

Speaker Verification

ML-SUPERB: Multilingual Speech Universal PERformance Benchmark

no code implementations18 May 2023 Jiatong Shi, Dan Berrebbi, William Chen, Ho-Lam Chung, En-Pei Hu, Wei Ping Huang, Xuankai Chang, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Shinji Watanabe

Speech processing Universal PERformance Benchmark (SUPERB) is a leaderboard to benchmark the performance of Self-Supervised Learning (SSL) models on various speech processing tasks.

Automatic Speech Recognition Language Identification +3

Can Large Language Models Be an Alternative to Human Evaluations?

no code implementations3 May 2023 Cheng-Han Chiang, Hung-Yi Lee

We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation: the texts rated higher by human experts are also rated higher by the LLMs.

Story Generation

Cascading and Direct Approaches to Unsupervised Constituency Parsing on Spoken Sentences

1 code implementation15 Mar 2023 Yuan Tseng, Cheng-I Lai, Hung-Yi Lee

The goal is to determine the spoken sentences' hierarchical syntactic structure in the form of constituency parse trees, such that each node is a span of audio that corresponds to a constituent.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

SpeechPrompt v2: Prompt Tuning for Speech Classification Tasks

no code implementations1 Mar 2023 Kai-Wei Chang, Yu-Kai Wang, Hua Shen, Iu-thing Kang, Wei-Cheng Tseng, Shang-Wen Li, Hung-Yi Lee

For speech processing, SpeechPrompt shows its high parameter efficiency and competitive performance on a few speech classification tasks.

Ranked #17 on Spoken Language Understanding on Fluent Speech Commands (using extra training data)

Classification Language Modelling +1

Ensemble knowledge distillation of self-supervised speech models

no code implementations24 Feb 2023 Kuan-Po Huang, Tzu-hsun Feng, Yu-Kuan Fu, Tsu-Yuan Hsu, Po-Chieh Yen, Wei-Cheng Tseng, Kai-Wei Chang, Hung-Yi Lee

We tried two different aggregation techniques, layerwise-average and layerwise-concatenation, to the representations of different teacher models and found that the former was more effective.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Hierarchical Programmatic Reinforcement Learning via Learning to Compose Programs

no code implementations30 Jan 2023 Guan-Ting Liu, En-Pei Hu, Pu-Jen Cheng, Hung-Yi Lee, Shao-Hua Sun

Aiming to produce reinforcement learning (RL) policies that are human-interpretable and can generalize better to novel scenarios, Trivedi et al. (2021) present a method (LEAPS) that first learns a program embedding space to continuously parameterize diverse programs from a pre-generated program dataset, and then searches for a task-solving program in the learned program embedding space when given a task.

reinforcement-learning Reinforcement Learning (RL)

SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks

no code implementations20 Dec 2022 Suwon Shon, Siddhant Arora, Chyi-Jiunn Lin, Ankita Pasad, Felix Wu, Roshan Sharma, Wei-Lun Wu, Hung-Yi Lee, Karen Livescu, Shinji Watanabe

In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape.

Dialog Act Classification Question Answering +4

General Framework for Self-Supervised Model Priming for Parameter-Efficient Fine-tuning

no code implementations2 Dec 2022 Shih-Cheng Huang, Shih-Heng Wang, Min-Han Shih, Saurav Sahay, Hung-Yi Lee

To tackle these issues, we propose a general framework to enhance the few-shot adaptation and cross-domain generalization ability of parameter-efficient methods.

Domain Generalization

CHAPTER: Exploiting Convolutional Neural Network Adapters for Self-supervised Speech Models

no code implementations1 Dec 2022 Zih-Ching Chen, Yu-Shun Sung, Hung-Yi Lee

However, such efficient tuning techniques only provide adaptation at the transformer layer, but failed to perform adaptation at the feature extractor.

Self-Supervised Learning

EURO: ESPnet Unsupervised ASR Open-source Toolkit

1 code implementation30 Nov 2022 Dongji Gao, Jiatong Shi, Shun-Po Chuang, Leibny Paola Garcia, Hung-Yi Lee, Shinji Watanabe, Sanjeev Khudanpur

This paper describes the ESPnet Unsupervised ASR Open-source Toolkit (EURO), an end-to-end open-source toolkit for unsupervised automatic speech recognition (UASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Model Extraction Attack against Self-supervised Speech Models

no code implementations29 Nov 2022 Tsu-Yuan Hsu, Chen-An Li, Tung-Yu Wu, Hung-Yi Lee

In the first stage, SSL is conducted on the large-scale unlabeled corpus to pre-train a small speech model.

Model extraction Self-Supervised Learning

MelHuBERT: A simplified HuBERT on Mel spectrograms

1 code implementation17 Nov 2022 Tzu-Quan Lin, Hung-Yi Lee, Hao Tang

Self-supervised models have had great success in learning speech representations that can generalize to various downstream tasks.

Automatic Speech Recognition Self-Supervised Learning +3

Compressing Transformer-based self-supervised models for speech processing

1 code implementation17 Nov 2022 Tzu-Quan Lin, Tsung-Huan Yang, Chun-Yao Chang, Kuang-Ming Chen, Tzu-hsun Feng, Hung-Yi Lee, Hao Tang

Despite the success of Transformers in self- supervised learning with applications to various downstream tasks, the computational cost of training and inference remains a major challenge for applying these models to a wide spectrum of devices.

Knowledge Distillation Model Compression +1

Bridging Speech and Textual Pre-trained Models with Unsupervised ASR

no code implementations6 Nov 2022 Jiatong Shi, Chan-Jan Hsu, Holam Chung, Dongji Gao, Paola Garcia, Shinji Watanabe, Ann Lee, Hung-Yi Lee

To be specific, we propose to use unsupervised automatic speech recognition (ASR) as a connector that bridges different modalities used in speech and textual pre-trained models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Once-for-All Sequence Compression for Self-Supervised Speech Models

no code implementations4 Nov 2022 Hsuan-Jui Chen, Yen Meng, Hung-Yi Lee

The sequence length along the time axis is often the dominant factor of the computation in speech processing.

M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval

no code implementations2 Nov 2022 Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-Yi Lee, David Harwath

This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.

Image Retrieval Retrieval +1

T5lephone: Bridging Speech and Text Self-supervised Models for Spoken Language Understanding via Phoneme level T5

1 code implementation1 Nov 2022 Chan-Jan Hsu, Ho-Lam Chung, Hung-Yi Lee, Yu Tsao

In Spoken language understanding (SLU), a natural solution is concatenating pre-trained speech models (e. g. HuBERT) and pretrained language models (PLM, e. g. T5).

Language Modelling Question Answering +1

Multimodal Transformer Distillation for Audio-Visual Synchronization

2 code implementations27 Oct 2022 Xuanjun Chen, Haibin Wu, Chung-Che Wang, Hung-Yi Lee, Jyh-Shing Roger Jang

This paper proposed an MTDVocaLiST model, which is trained by our proposed multimodal Transformer distillation (MTD) loss.

Audio-Visual Synchronization

On Compressing Sequences for Self-Supervised Speech Models

no code implementations13 Oct 2022 Yen Meng, Hsuan-Jui Chen, Jiatong Shi, Shinji Watanabe, Paola Garcia, Hung-Yi Lee, Hao Tang

Subsampling while training self-supervised models not only improves the overall performance on downstream tasks under certain frame rates, but also brings significant speed-up in inference.

Self-Supervised Learning

Are Synonym Substitution Attacks Really Synonym Substitution Attacks?

no code implementations6 Oct 2022 Cheng-Han Chiang, Hung-Yi Lee

In this paper, we explore the following question: Are synonym substitution attacks really synonym substitution attacks (SSAs)?

Sentence

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

1 code implementation3 Oct 2022 Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-Yi Lee, David Harwath

Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly.

Language Modelling Retrieval +1

The Efficacy of Self-Supervised Speech Models for Audio Representations

1 code implementation26 Sep 2022 Tung-Yu Wu, Chen-An Li, Tzu-Han Lin, Tsu-Yuan Hsu, Hung-Yi Lee

Extensive experiments on speech and non-speech audio datasets are conducted to investigate the representation abilities of our ensemble method and its single constituent model.

Pitch Classification Representation Learning +1

Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding

no code implementations27 Jun 2022 Wei-Ping Huang, Po-Chun Chen, Sung-Feng Huang, Hung-Yi Lee

This paper studies a transferable phoneme embedding framework that aims to deal with the cross-lingual text-to-speech (TTS) problem under the few-shot setting.

Few-Shot Learning Transfer Learning

Tackling Spoofing-Aware Speaker Verification with Multi-Model Fusion

no code implementations18 Jun 2022 Haibin Wu, Jiawen Kang, Lingwei Meng, Yang Zhang, Xixin Wu, Zhiyong Wu, Hung-Yi Lee, Helen Meng

However, previous works show that state-of-the-art ASV models are seriously vulnerable to voice spoofing attacks, and the recently proposed high-performance spoofing countermeasure (CM) models only focus solely on the standalone anti-spoofing tasks, and ignore the subsequent speaker verification process.

Open-Ended Question Answering Speaker Verification

Searching for the Essence of Adversarial Perturbations

no code implementations30 May 2022 Dennis Y. Menn, Tzu-hsun Feng, Hung-Yi Lee

Neural networks have demonstrated state-of-the-art performance in various machine learning fields.

Autonomous Driving

Structured Prompt Tuning

no code implementations24 May 2022 Chi-Liang Liu, Hung-Yi Lee, Wen-tau Yih

We propose structured prompt tuning, a simple and effective method to improve prompt tuning.

Self-Supervised Speech Representation Learning: A Review

no code implementations21 May 2022 Abdelrahman Mohamed, Hung-Yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, Tara N. Sainath, Shinji Watanabe

Although self-supervised speech representation is still a nascent research area, it is closely related to acoustic word embedding and learning with zero lexical resources, both of which have seen active research for many years.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding

no code implementations ACL 2022 Chan-Jan Hsu, Hung-Yi Lee, Yu Tsao

Transformer-based models are widely used in natural language understanding (NLU) tasks, and multimodal transformers have been effective in visual-language tasks.

Natural Language Understanding

Re-Examining Human Annotations for Interpretable NLP

no code implementations10 Apr 2022 Cheng-Han Chiang, Hung-Yi Lee

Our results reveal that the annotation quality is highly subject to the workers' qualification, and workers can be guided to provide certain annotations by the instructions.

Understanding, Detecting, and Separating Out-of-Distribution Samples and Adversarial Samples in Text Classification

no code implementations9 Apr 2022 Cheng-Han Chiang, Hung-Yi Lee

Based on our observations, we propose a simple method to separate ID, OOD, and Adv samples using the hidden representations and output probabilities of the model.

text-classification Text Classification

SpeechPrompt: An Exploration of Prompt Tuning on Generative Spoken Language Model for Speech Processing Tasks

1 code implementation31 Mar 2022 Kai-Wei Chang, Wei-Cheng Tseng, Shang-Wen Li, Hung-Yi Lee

We report in this paper the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM).

Language Modelling Self-Supervised Learning

Improving Distortion Robustness of Self-supervised Speech Processing Tasks with Domain Adaptation

no code implementations30 Mar 2022 Kuan Po Huang, Yu-Kuan Fu, Yu Zhang, Hung-Yi Lee

Speech distortions are a long-standing problem that degrades the performance of supervisely trained speech processing models.

Domain Adaptation

Spoofing-Aware Speaker Verification by Multi-Level Fusion

no code implementations29 Mar 2022 Haibin Wu, Lingwei Meng, Jiawen Kang, Jinchao Li, Xu Li, Xixin Wu, Hung-Yi Lee, Helen Meng

In the second-level fusion, the CM score and ASV scores directly from ASV systems will be concatenated into a prediction block for the final decision.

Speaker Verification

Listen, Adapt, Better WER: Source-free Single-utterance Test-time Adaptation for Automatic Speech Recognition

2 code implementations27 Mar 2022 Guan-Ting Lin, Shang-Wen Li, Hung-Yi Lee

Although deep learning-based end-to-end Automatic Speech Recognition (ASR) has shown remarkable performance in recent years, it suffers severe performance regression on test samples drawn from different data distributions.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Exploring Continuous Integrate-and-Fire for Adaptive Simultaneous Speech Translation

1 code implementation22 Mar 2022 Chih-Chiang Chang, Hung-Yi Lee

Simultaneous speech translation (SimulST) is a challenging task aiming to translate streaming speech before the complete input is observed.

Translation

Anticipation-Free Training for Simultaneous Machine Translation

1 code implementation IWSLT (ACL) 2022 Chih-Chiang Chang, Shun-Po Chuang, Hung-Yi Lee

Existing methods increase latency or introduce adaptive read-write policies for SimulMT models to handle local reordering and improve translation quality.

Hallucination Machine Translation +2

Membership Inference Attacks Against Self-supervised Speech Models

1 code implementation9 Nov 2021 Wei-Cheng Tseng, Wei-Tsung Kao, Hung-Yi Lee

Recently, adapting the idea of self-supervised learning (SSL) on continuous speech has started gaining attention.

Self-Supervised Learning

Characterizing the adversarial vulnerability of speech self-supervised learning

no code implementations8 Nov 2021 Haibin Wu, Bo Zheng, Xu Li, Xixin Wu, Hung-Yi Lee, Helen Meng

As the paradigm of the self-supervised learning upstream model followed by downstream tasks arouses more attention in the speech community, characterizing the adversarial robustness of such paradigm is of high priority.

Adversarial Robustness Benchmarking +2

Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech

1 code implementation7 Nov 2021 Sung-Feng Huang, Chyi-Jiunn Lin, Da-Rong Liu, Yi-Chen Chen, Hung-Yi Lee

On the one hand, speaker adaptation methods fine-tune a trained multi-speaker text-to-speech (TTS) model with few enrolled samples.

Meta-Learning Speech Synthesis

Don't speak too fast: The impact of data bias on self-supervised speech models

no code implementations15 Oct 2021 Yen Meng, Yi-Hui Chou, Andy T. Liu, Hung-Yi Lee

Self-supervised Speech Models (S3Ms) have been proven successful in many speech downstream tasks, like ASR.

Toward Degradation-Robust Voice Conversion

no code implementations14 Oct 2021 Chien-yu Huang, Kai-Wei Chang, Hung-Yi Lee

However, in real-world scenarios, it is difficult to collect clean utterances of a speaker, and they are usually degraded by noises or reverberations.

Denoising Speech Enhancement +1

S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations

2 code implementations12 Oct 2021 Wen-Chin Huang, Shu-wen Yang, Tomoki Hayashi, Hung-Yi Lee, Shinji Watanabe, Tomoki Toda

In this work, we provide a series of in-depth analyses by benchmarking on the two tasks in VCC2020, namely intra-/cross-lingual any-to-one (A2O) VC, as well as an any-to-any (A2A) setting.

Benchmarking Voice Conversion

CheerBots: Chatbots toward Empathy and Emotionusing Reinforcement Learning

no code implementations8 Oct 2021 Jiun-Hao Jhan, Chao-Peng Liu, Shyh-Kang Jeng, Hung-Yi Lee

Apart from the coherence and fluency of responses, an empathetic chatbot emphasizes more on people's feelings.

Chatbot reinforcement-learning +2

Analyzing the Robustness of Unsupervised Speech Recognition

no code implementations7 Oct 2021 Guan-Ting Lin, Chan-Jan Hsu, Da-Rong Liu, Hung-Yi Lee, Yu Tsao

In this work, we further analyze the training robustness of unsupervised ASR on the domain mismatch scenarios in which the domains of unpaired speech and text are different.

Generative Adversarial Network speech-recognition +2

DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT

1 code implementation5 Oct 2021 Heng-Jui Chang, Shu-wen Yang, Hung-Yi Lee

Self-supervised speech representation learning methods like wav2vec 2. 0 and Hidden-unit BERT (HuBERT) leverage unlabeled speech data for pre-training and offer good representations for numerous speech processing tasks.

Multi-Task Learning Representation Learning

On the Transferability of Pre-trained Language Models: A Study from Artificial Datasets

1 code implementation8 Sep 2021 Cheng-Han Chiang, Hung-Yi Lee

In this work, we study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.

Parallelized Reverse Curriculum Generation

no code implementations4 Aug 2021 Zih-Yun Chiu, Yi-Lin Tuan, Hung-Yi Lee, Li-Chen Fu

For reinforcement learning (RL), it is challenging for an agent to master a task that requires a specific series of actions due to sparse rewards.

Reinforcement Learning (RL)

Voting for the right answer: Adversarial defense for speaker verification

1 code implementation15 Jun 2021 Haibin Wu, Yang Zhang, Zhiyong Wu, Dong Wang, Hung-Yi Lee

Automatic speaker verification (ASV) is a well developed technology for biometric identification, and has been ubiquitous implemented in security-critic applications, such as banking and access control.

Adversarial Defense Speaker Verification

Investigating the Reordering Capability in CTC-based Non-Autoregressive End-to-End Speech Translation

1 code implementation Findings (ACL) 2021 Shun-Po Chuang, Yung-Sung Chuang, Chih-Chiang Chang, Hung-Yi Lee

We study the possibilities of building a non-autoregressive speech-to-text translation model using connectionist temporal classification (CTC), and use CTC-based automatic speech recognition as an auxiliary task to improve the performance.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Improving Cross-Lingual Reading Comprehension with Self-Training

no code implementations8 May 2021 Wei-Cheng Huang, Chien-yu Huang, Hung-Yi Lee

Substantial improvements have been made in machine reading comprehension, where the machine answers questions based on a given context.

Machine Reading Comprehension

S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations

3 code implementations7 Apr 2021 Jheng-Hao Lin, Yist Y. Lin, Chung-Ming Chien, Hung-Yi Lee

AUTOVC used dvector to extract speaker information, and self-supervised learning (SSL) features like wav2vec 2. 0 is used in FragmentVC to extract the phonetic content information.

Self-Supervised Learning Voice Conversion

Towards Lifelong Learning of End-to-end ASR

no code implementations4 Apr 2021 Heng-Jui Chang, Hung-Yi Lee, Lin-shan Lee

We can collect new data describing the new environment and fine-tune the system, but this naturally leads to higher error rates for the earlier datasets, referred to as catastrophic forgetting.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Auto-KWS 2021 Challenge: Task, Datasets, and Baselines

1 code implementation31 Mar 2021 Jingsong Wang, Yuxuan He, Chunyu Zhao, Qijie Shao, Wei-Wei Tu, Tom Ko, Hung-Yi Lee, Lei Xie

Auto-KWS 2021 challenge calls for automated machine learning (AutoML) solutions to automate the process of applying machine learning to a customized keyword spotting task.

AutoML BIG-bench Machine Learning +1

Is BERT a Cross-Disciplinary Knowledge Learner? A Surprising Finding of Pre-trained Models' Transferability

no code implementations12 Mar 2021 Wei-Tsung Kao, Hung-Yi Lee

This paper investigates whether the power of the models pre-trained on text data, such as BERT, can be transferred to general token sequence classification applications.

General Classification text-classification +1

Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech

1 code implementation6 Mar 2021 Chung-Ming Chien, Jheng-Hao Lin, Chien-yu Huang, Po-chun Hsu, Hung-Yi Lee

The few-shot multi-speaker multi-style voice cloning task is to synthesize utterances with voice and speaking style similar to a reference speaker given only a few reference samples.

Voice Cloning Voice Conversion

Pre-Training a Language Model Without Human Language

no code implementations22 Dec 2020 Cheng-Han Chiang, Hung-Yi Lee

In this paper, we study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance.

Language Modelling

TaylorGAN: Neighbor-Augmented Policy Update Towards Sample-Efficient Natural Language Generation

1 code implementation NeurIPS 2020 Chun-Hsing Lin, Siang-Ruei Wu, Hung-Yi Lee, Yun-Nung Chen

Score function-based natural language generation (NLG) approaches such as REINFORCE, in general, suffer from low sample efficiency and training instability problems.

Text Generation

TaylorGAN: Neighbor-Augmented Policy Update for Sample-Efficient Natural Language Generation

1 code implementation27 Nov 2020 Chun-Hsing Lin, Siang-Ruei Wu, Hung-Yi Lee, Yun-Nung Chen

Score function-based natural language generation (NLG) approaches such as REINFORCE, in general, suffer from low sample efficiency and training instability problems.

Text Generation

How Far Are We from Robust Voice Conversion: A Survey

no code implementations24 Nov 2020 Tzu-Hsien Huang, Jheng-Hao Lin, Chien-yu Huang, Hung-Yi Lee

Voice conversion technologies have been greatly improved in recent years with the help of deep learning, but their capabilities of producing natural sounding utterances in different conditions remain unclear.

Speaker Identification Voice Conversion

Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis

1 code implementation12 Nov 2020 Chung-Ming Chien, Hung-Yi Lee

Prosody modeling is an essential component in modern text-to-speech (TTS) frameworks.

Speech Synthesis

AGAIN-VC: A One-shot Voice Conversion using Activation Guidance and Adaptive Instance Normalization

1 code implementation31 Oct 2020 Yen-Hao Chen, Da-Yi Wu, Tsung-Han Wu, Hung-Yi Lee

With a proper activation as an information bottleneck on content embeddings, the trade-off between the synthesis quality and the speaker similarity of the converted speech is improved drastically.

Audio and Speech Processing Sound

Stabilizing Label Assignment for Speech Separation by Self-supervised Pre-training

1 code implementation29 Oct 2020 Sung-Feng Huang, Shun-Po Chuang, Da-Rong Liu, Yi-Chen Chen, Gene-Ping Yang, Hung-Yi Lee

Speech separation has been well developed, with the very successful permutation invariant training (PIT) approach, although the frequent label assignment switching happening during PIT training remains to be a problem when better convergence speed and achievable performance are desired.

Ranked #6 on Speech Separation on Libri2Mix (using extra training data)

Speaker Separation Speech Enhancement +1

FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention

2 code implementations27 Oct 2020 Yist Y. Lin, Chung-Ming Chien, Jheng-Hao Lin, Hung-Yi Lee, Lin-shan Lee

Any-to-any voice conversion aims to convert the voice from and to any speakers even unseen during training, which is much more challenging compared to one-to-one or many-to-many tasks, but much more attractive in real-world scenarios.

Disentanglement Speaker Verification +1

Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining

1 code implementation26 Oct 2020 Cheng-I Lai, Yung-Sung Chuang, Hung-Yi Lee, Shang-Wen Li, James Glass

Much recent work on Spoken Language Understanding (SLU) is limited in at least one of three ways: models were trained on oracle text input and neglected ASR errors, models were trained to predict only intents without the slot values, or models were trained on a large amount of in-house data.

Language Modelling Spoken Language Understanding

What makes multilingual BERT multilingual?

no code implementations20 Oct 2020 Chi-Liang Liu, Tsung-Yuan Hsu, Yung-Sung Chuang, Hung-Yi Lee

Recently, multilingual BERT works remarkably well on cross-lingual transfer tasks, superior to static non-contextualized word embeddings.

Cross-Lingual Transfer Word Embeddings

Pretrained Language Model Embryology: The Birth of ALBERT

1 code implementation EMNLP 2020 Cheng-Han Chiang, Sung-Feng Huang, Hung-Yi Lee

These findings suggest that knowledge of a pretrained model varies during pretraining, and having more pretrain steps does not necessarily provide a model with more comprehensive knowledge.

Language Modelling POS +1

Investigation of Sentiment Controllable Chatbot

no code implementations11 Jul 2020 Hung-Yi Lee, Cheng-Hao Ho, Chien-Fu Lin, Chiung-Chih Chang, Chih-Wei Lee, Yau-Shian Wang, Tsung-Yuan Hsu, Kuan-Yu Chen

Conventional seq2seq chatbot models attempt only to find sentences with the highest probabilities conditioned on the input sequences, without considering the sentiment of the output sentences.

Chatbot reinforcement-learning +1

VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net architecture

1 code implementation7 Jun 2020 Da-Yi Wu, Yen-Hao Chen, Hung-Yi Lee

Voice conversion (VC) is a task that transforms the source speaker's timbre, accent, and tones in audio into another one's while preserving the linguistic content.

Disentanglement Quantization +1

Understanding Self-Attention of Self-Supervised Audio Transformers

2 code implementations5 Jun 2020 Shu-wen Yang, Andy T. Liu, Hung-Yi Lee

Self-supervised Audio Transformers (SAT) enable great success in many downstream speech applications like ASR, but how they work has not been widely explored yet.

Defense for Black-box Attacks on Anti-spoofing Models by Self-Supervised Learning

5 code implementations5 Jun 2020 Haibin Wu, Andy T. Liu, Hung-Yi Lee

To explore this issue, we proposed to employ Mockingjay, a self-supervised learning based model, to protect anti-spoofing models against adversarial attacks in the black-box scenario.

Self-Supervised Learning Speaker Verification +1

Defending Your Voice: Adversarial Attack on Voice Conversion

1 code implementation18 May 2020 Chien-yu Huang, Yist Y. Lin, Hung-Yi Lee, Lin-shan Lee

We introduce human imperceptible noise into the utterances of a speaker whose voice is to be defended.

Adversarial Attack Voice Conversion

Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation

no code implementations16 May 2020 Tao Tu, Yuan-Jui Chen, Alexander H. Liu, Hung-Yi Lee

The experiment results demonstrate that with only an hour of paired speech data, no matter the paired data is from multiple speakers or a single speaker, the proposed model can generate intelligible speech in different voices.

Speech Synthesis Text-To-Speech Synthesis

WG-WaveNet: Real-Time High-Fidelity Speech Synthesis without GPU

1 code implementation15 May 2020 Po-chun Hsu, Hung-Yi Lee

As we design a flow-based model that is heavily compressed, the proposed model requires much less computational resources compared to other waveform generation models during both training and inference time; even though the model is highly compressed, the post-filter maintains the quality of generated waveform.

Speech Synthesis Text-To-Speech Synthesis Audio and Speech Processing Sound

DARTS-ASR: Differentiable Architecture Search for Multilingual Speech Recognition and Adaptation

no code implementations13 May 2020 Yi-Chen Chen, Jui-Yang Hsu, Cheng-Kuang Lee, Hung-Yi Lee

In order to examine the generalizability of DARTS-ASR, we apply our approach not only on many languages to perform monolingual ASR, but also on a multilingual ASR setting.

speech-recognition Speech Recognition

End-to-end Whispered Speech Recognition with Frequency-weighted Approaches and Pseudo Whisper Pre-training

no code implementations5 May 2020 Heng-Jui Chang, Alexander H. Liu, Hung-Yi Lee, Lin-shan Lee

Whispering is an important mode of human speech, but no end-to-end recognition results for it were reported yet, probably due to the scarcity of available whispered speech data.

speech-recognition Speech Recognition +1

A Study of Cross-Lingual Ability and Language-specific Information in Multilingual BERT

no code implementations20 Apr 2020 Chi-Liang Liu, Tsung-Yuan Hsu, Yung-Sung Chuang, Hung-Yi Lee

Recently, multilingual BERT works remarkably well on cross-lingual transfer tasks, superior to static non-contextualized word embeddings.

Cross-Lingual Transfer Translation +1

Defense against adversarial attacks on spoofing countermeasures of ASV

no code implementations6 Mar 2020 Haibin Wu, Songxiang Liu, Helen Meng, Hung-Yi Lee

Various forefront countermeasure methods for automatic speaker verification (ASV) with considerable performance in anti-spoofing are proposed in the ASVspoof 2019 challenge.

Speaker Verification

BERT's output layer recognizes all hidden layers? Some Intriguing Phenomena and a simple way to boost BERT

no code implementations25 Jan 2020 Wei-Tsung Kao, Tsung-Han Wu, Po-Han Chi, Chun-Cheng Hsieh, Hung-Yi Lee

Although Bidirectional Encoder Representations from Transformers (BERT) have achieved tremendous success in many natural language processing (NLP) tasks, it remains a black box.

Sentence

MITAS: A Compressed Time-Domain Audio Separation Network with Parameter Sharing

no code implementations9 Dec 2019 Chao-I Tuan, Yuan-Kuei Wu, Hung-Yi Lee, Yu Tsao

Our experimental results first confirmed the robustness of our MiTAS on two types of perturbations in mixed audio.

Speech Separation

Towards Robust Neural Vocoding for Speech Generation: A Survey

no code implementations5 Dec 2019 Po-chun Hsu, Chun-hsuan Wang, Andy T. Liu, Hung-Yi Lee

We found out that the speaker variety is much more important for achieving a universal vocoder than the language.

Speech Synthesis Voice Conversion

J-Net: Randomly weighted U-Net for audio source separation

1 code implementation29 Nov 2019 Bo-Wen Chen, Yen-Min Hsu, Hung-Yi Lee

According to these discoveries, we pose two questions: what is the value of randomly weighted networks in difficult generative audio tasks such as audio source separation and does such positive correlation still exist when it comes to large random networks and their trained counterparts?

Audio Source Separation

Training a code-switching language model with monolingual data

no code implementations14 Nov 2019 Shun-Po Chuang, Tzu-Wei Sung, Hung-Yi Lee

A lack of code-switching data complicates the training of code-switching (CS) language models.

Language Modelling Translation +1

Towards Unsupervised Speech Recognition and Synthesis with Quantized Speech Representation Learning

no code implementations28 Oct 2019 Alexander H. Liu, Tao Tu, Hung-Yi Lee, Lin-shan Lee

In this paper we propose a Sequential Representation Quantization AutoEncoder (SeqRQ-AE) to learn from primarily unpaired audio data and produce sequences of representations very close to phoneme sequences of speech utterances.

Clustering Quantization +4

Sequence-to-sequence Automatic Speech Recognition with Word Embedding Regularization and Fused Decoding

1 code implementation28 Oct 2019 Alexander H. Liu, Tzu-Wei Sung, Shun-Po Chuang, Hung-Yi Lee, Lin-shan Lee

This allows the decoder to consider the semantic consistency during decoding by absorbing the information carried by the transformed decoder feature, which is learned to be close to the target word embedding.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Interrupted and cascaded permutation invariant training for speech separation

1 code implementation28 Oct 2019 Gene-Ping Yang, Szu-Lin Wu, Yao-Wen Mao, Hung-Yi Lee, Lin-shan Lee

Permutation Invariant Training (PIT) has long been a stepping stone method for training speech separation model in handling the label ambiguity problem.

Speech Separation

SpeechBERT: An Audio-and-text Jointly Learned Language Model for End-to-end Spoken Question Answering

no code implementations25 Oct 2019 Yung-Sung Chuang, Chi-Liang Liu, Hung-Yi Lee, Lin-shan Lee

In addition to the potential of end-to-end SQA, the SpeechBERT can also be considered for many other spoken language understanding tasks just as BERT for many text processing tasks.

Language Modelling Question Answering +2

Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders

7 code implementations25 Oct 2019 Andy T. Liu, Shu-wen Yang, Po-Han Chi, Po-chun Hsu, Hung-Yi Lee

We present Mockingjay as a new speech representation learning approach, where bidirectional Transformer encoders are pre-trained on a large amount of unlabeled speech.

General Classification Representation Learning +3

Adversarial Attacks on Spoofing Countermeasures of automatic speaker verification

1 code implementation19 Oct 2019 Songxiang Liu, Haibin Wu, Hung-Yi Lee, Helen Meng

High-performance spoofing countermeasure systems for automatic speaker verification (ASV) have been proposed in the ASVspoof 2019 challenge.

Speaker Verification

DyKgChat: Benchmarking Dialogue Generation Grounding on Dynamic Knowledge Graphs

1 code implementation IJCNLP 2019 Yi-Lin Tuan, Yun-Nung Chen, Hung-Yi Lee

This paper proposes a new task about how to apply dynamic knowledge graphs in neural conversation model and presents a novel TV series conversation corpus (DyKgChat) for the task.

Benchmarking Dialogue Generation +1

Tree Transformer: Integrating Tree Structures into Self-Attention

3 code implementations IJCNLP 2019 Yau-Shian Wang, Hung-Yi Lee, Yun-Nung Chen

This paper proposes Tree Transformer, which adds an extra constraint to attention heads of the bidirectional Transformer encoder in order to encourage the attention heads to follow tree structures.

Language Modelling

Order-free Learning Alleviating Exposure Bias in Multi-label Classification

1 code implementation8 Sep 2019 Che-Ping Tsai, Hung-Yi Lee

In this paper, we propose a new framework for MLC which does not rely on a predefined label order and thus alleviates exposure bias.

General Classification Multi-Label Classification

Cross-Lingual Transfer Learning for Question Answering

no code implementations13 Jul 2019 Chia-Hsuan Lee, Hung-Yi Lee

In this paper, we explore the problem of cross-lingual transfer learning for QA, where a source language task with plentiful annotations is utilized to improve the performance of a QA model on a target language task with limited available annotations.

Cross-Lingual Transfer Machine Translation +4

Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion

1 code implementation28 May 2019 Andy T. Liu, Po-chun Hsu, Hung-Yi Lee

We found that the proposed encoding method offers automatic extraction of speech content from speaker style, and is sufficient to cover full linguistic content in a given language.

Voice Conversion

Improved Speech Separation with Time-and-Frequency Cross-domain Joint Embedding and Clustering

1 code implementation16 Apr 2019 Gene-Ping Yang, Chao-I Tuan, Hung-Yi Lee, Lin-shan Lee

Substantial effort has been reported based on approaches over spectrogram, which is well known as the standard time-and-frequency cross-domain representation for speech signals.

Clustering Speech Separation

End-to-end Text-to-speech for Low-resource Languages by Cross-Lingual Transfer Learning

no code implementations13 Apr 2019 Tao Tu, Yuan-Jui Chen, Cheng-chieh Yeh, Hung-Yi Lee

In this paper, we aim to build TTS systems for such low-resource (target) languages where only very limited paired data are available.

Cross-Lingual Transfer Transfer Learning

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

11 code implementations10 Apr 2019 Ju-chieh Chou, Cheng-chieh Yeh, Hung-Yi Lee

Recently, voice conversion (VC) without parallel data has been successfully adapted to multi-target scenario in which a single model is trained to convert the input voice to many different speakers.

Voice Conversion

From Semi-supervised to Almost-unsupervised Speech Recognition with Very-low Resource by Jointly Learning Phonetic Structures from Audio and Text Embeddings

no code implementations10 Apr 2019 Yi-Chen Chen, Sung-Feng Huang, Hung-Yi Lee, Lin-shan Lee

However, we note human babies start to learn the language by the sounds (or phonetic structures) of a small number of exemplar words, and "generalize" such knowledge to other words without hearing a large amount of data.

speech-recognition Speech Recognition +1

Completely Unsupervised Speech Recognition By A Generative Adversarial Network Harmonized With Iteratively Refined Hidden Markov Models

no code implementations8 Apr 2019 Kuan-Yu Chen, Che-Ping Tsai, Da-Rong Liu, Hung-Yi Lee, Lin-shan Lee

Producing a large annotated speech corpus for training ASR systems remains difficult for more than 95% of languages all over the world which are low-resourced, but collecting a relatively big unlabeled data set for such languages is more achievable.

Generative Adversarial Network speech-recognition +2

Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection

no code implementations7 Nov 2018 Sung-Feng Huang, Yi-Chen Chen, Hung-Yi Lee, Lin-shan Lee

Embedding audio signal segments into vectors with fixed dimensionality is attractive because all following processing will be easier and more efficient, for example modeling, classifying or indexing.

Clustering

Adversarial Training of End-to-end Speech Recognition Using a Criticizing Language Model

no code implementations2 Nov 2018 Alexander H. Liu, Hung-Yi Lee, Lin-shan Lee

In this paper we proposed a novel Adversarial Training (AT) approach for end-to-end speech recognition using a Criticizing Language Model (CLM).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Generative Adversarial Networks for Unpaired Voice Transformation on Impaired Speech

2 code implementations30 Oct 2018 Li-Wei Chen, Hung-Yi Lee, Yu Tsao

This paper focuses on using voice conversion (VC) to improve the speech intelligibility of surgical patients who have had parts of their articulators removed.

Speech Recognition Voice Conversion

Learning to Encode Text as Human-Readable Summaries using Generative Adversarial Networks

1 code implementation EMNLP 2018 Yau-Shian Wang, Hung-Yi Lee

The generator encodes the input text into a shorter word sequence, and the reconstructor recovers the generator input from the generator output.

Abstractive Text Summarization

TopicGAN: Unsupervised Text Generation from Explainable Latent Topics

no code implementations27 Sep 2018 Yau-Shian Wang, Yun-Nung Chen, Hung-Yi Lee

Learning discrete representations of data and then generating data from the discovered representations have been increasingly studied because the obtained discrete representations can benefit unsupervised learning.

Image Generation Text Generation

Temporal Pattern Attention for Multivariate Time Series Forecasting

4 code implementations12 Sep 2018 Shun-Yao Shih, Fan-Keng Sun, Hung-Yi Lee

To obtain accurate prediction, it is crucial to model long-term dependency in time series data, which can be achieved to some good extent by recurrent neural network (RNN) with attention mechanism.

Multivariate Time Series Forecasting Time Series +1

Proximal Policy Optimization and its Dynamic Version for Sequence Generation

no code implementations24 Aug 2018 Yi-Lin Tuan, Jinzhi Zhang, Yujia Li, Hung-Yi Lee

In sequence generation task, many works use policy gradient for model optimization to tackle the intractable backpropagation issue when maximizing the non-differentiable evaluation metrics or fooling the discriminator in adversarial learning.

Chatbot Model Optimization +2

Improving Conditional Sequence Generative Adversarial Networks by Stepwise Evaluation

1 code implementation16 Aug 2018 Yi-Lin Tuan, Hung-Yi Lee

To stabilize the training of SeqGAN, Monte Carlo tree search (MCTS) or reward at every generation step (REGS) is used to evaluate the goodness of a generated subsequence.

Dialogue Generation

Rhythm-Flexible Voice Conversion without Parallel Data Using Cycle-GAN over Phoneme Posteriorgram Sequences

1 code implementation9 Aug 2018 Cheng-chieh Yeh, Po-chun Hsu, Ju-chieh Chou, Hung-Yi Lee, Lin-shan Lee

In this way, the length constraint mentioned above is removed to offer rhythm-flexible voice conversion without requiring parallel data.

Sound Audio and Speech Processing

ODSQA: Open-domain Spoken Question Answering Dataset

1 code implementation7 Aug 2018 Chia-Hsuan Lee, Shang-Ming Wang, Huan-Cheng Chang, Hung-Yi Lee

Reading comprehension by machine has been widely studied, but machine comprehension of spoken content is still a less investigated problem.

Data Augmentation Question Answering +1

Segmental Audio Word2Vec: Representing Utterances as Sequences of Vectors with Applications in Spoken Term Detection

no code implementations7 Aug 2018 Yu-Hsuan Wang, Hung-Yi Lee, Lin-shan Lee

In this paper, we extend audio Word2Vec from word-level to utterance-level by proposing a new segmental audio Word2Vec, in which unsupervised spoken word boundary segmentation and audio Word2Vec are jointly learned and mutually enhanced, so an utterance can be directly represented as a sequence of vectors carrying phonetic structure information.

Segmentation

Cannot find the paper you are looking for? You can Submit a new open access paper.