Search Results for author: Hung-Yi Lee

Found 251 papers, 100 papers with code

Self-supervised Representation Learning for Speech Processing

1 code implementation NAACL (ACL) 2022 Hung-Yi Lee, Abdelrahman Mohamed, Shinji Watanabe, Tara Sainath, Karen Livescu, Shang-Wen Li, Shu-wen Yang, Katrin Kirchhoff

Due to the growing popularity of SSL, and the shared mission of the areas in bringing speech and language technologies to more use cases with better quality and scaling the technologies for under-represented languages, we propose this tutorial to systematically survey the latest SSL techniques, tools, datasets, and performance achievement in speech processing.

Representation Learning

Is BERT a Cross-Disciplinary Knowledge Learner? A Surprising Finding of Pre-trained Models’ Transferability

no code implementations Findings (EMNLP) 2021 Wei-Tsung Kao, Hung-Yi Lee

This paper investigates whether the power of the models pre-trained on text data, such as BERT, can be transferred to general token sequence classification applications.

text-classification Text Classification

Listen and Speak Fairly: A Study on Semantic Gender Bias in Speech Integrated Large Language Models

1 code implementation9 Jul 2024 Yi-Cheng Lin, Tzu-Quan Lin, Chih-Kai Yang, Ke-Han Lu, Wei-Chih Chen, Chun-Yi Kuan, Hung-Yi Lee

Speech Integrated Large Language Models (SILLMs) combine large language models with speech perception to perform diverse tasks, such as emotion recognition to speaker verification, demonstrating universal audio understanding capability.

coreference-resolution Emotion Recognition +4

Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course

no code implementations7 Jul 2024 Cheng-Han Chiang, Wei-Chih Chen, Chun-Yi Kuan, Chienchou Yang, Hung-Yi Lee

Based on student responses, we find that LLM-based assignment evaluators are generally acceptable to students when students have free access to these LLM-based evaluators.

Language Modelling Large Language Model

Investigating the Effects of Large-Scale Pseudo-Stereo Data and Different Speech Foundation Model on Dialogue Generative Spoken Language Model

no code implementations2 Jul 2024 Yu-Kuan Fu, Cheng-Kuang Lee, Hsiu-Hsuan Wang, Hung-Yi Lee

Recent efforts in Spoken Dialogue Modeling aim to synthesize spoken dialogue without the need for direct transcription, thereby preserving the wealth of non-textual information inherent in speech.

Dialogue Generation Diversity +1

DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging

no code implementations1 Jul 2024 Tzu-Han Lin, Chen-An Li, Hung-Yi Lee, Yun-Nung Chen

Reinforcement learning from human feedback (RLHF) is a popular strategy for aligning large language models (LLMs) with desired behaviors.

DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment

no code implementations27 Jun 2024 Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, He Huang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-Yi Lee

Recent speech language models (SLMs) typically incorporate pre-trained speech models to extend the capabilities from large language models (LLMs).

Descriptive Instruction Following

Can LLMs Understand the Implication of Emphasized Sentences in Dialogue?

no code implementations16 Jun 2024 Guan-Ting Lin, Hung-Yi Lee

Emphasis is a crucial component in human communication, which indicates the speaker's intention and implication beyond pure text in dialogue.

Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies

no code implementations16 Jun 2024 Hung-Ting Su, Chun-Tong Chao, Ya-Ching Hsu, Xudong Lin, Yulei Niu, Hung-Yi Lee, Winston H. Hsu

This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills: (1) Abstract Perception: understanding and tokenizing abstract concepts in videos, and (2) Long-range Compositional Reasoning: planning and integrating intermediate reasoning steps for understanding long-range videos with numerous frames.

On the Evaluation of Speech Foundation Models for Spoken Language Understanding

no code implementations14 Jun 2024 Siddhant Arora, Ankita Pasad, Chung-Ming Chien, Jionghao Han, Roshan Sharma, Jee-weon Jung, Hira Dhamyal, William Chen, Suwon Shon, Hung-Yi Lee, Karen Livescu, Shinji Watanabe

To answer this, we perform an extensive evaluation of multiple supervised and self-supervised SFMs using several evaluation protocols: (i) frozen SFMs with a lightweight prediction head, (ii) frozen SFMs with a complex prediction head, and (iii) fine-tuned SFMs with a lightweight prediction head.

Benchmarking speech-recognition +2

StreamBench: Towards Benchmarking Continuous Improvement of Language Agents

1 code implementation13 Jun 2024 Cheng-Kuang Wu, Zhi Rui Tam, Chieh-Yen Lin, Yun-Nung Chen, Hung-Yi Lee

Recent works have shown that large language model (LLM) agents are able to improve themselves from experience, which is an important ability for continuous enhancement post-deployment.

Benchmarking Language Modelling +1

ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

no code implementations12 Jun 2024 Jiatong Shi, Shih-Heng Wang, William Chen, Martijn Bartelds, Vanya Bannihatti Kumar, Jinchuan Tian, Xuankai Chang, Dan Jurafsky, Karen Livescu, Hung-Yi Lee, Shinji Watanabe

This paper presents ML-SUPERB~2. 0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models across downstream models, fine-tuning setups, and efficient model adaptation approaches.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models

1 code implementation12 Jun 2024 Chun-Yi Kuan, Wei-Ping Huang, Hung-Yi Lee

Large audio-language models (LALMs) enhance traditional large language models by integrating audio perception capabilities, allowing them to tackle audio-related tasks.

Audio captioning Hallucination +2

CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems

no code implementations11 Jun 2024 Haibin Wu, Yuan Tseng, Hung-Yi Lee

Additionally, we verify that anti-spoofing models trained on commonly used datasets cannot detect synthesized speech from current codec-based speech generation systems.

Audio Synthesis Face Swapping +1

Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper

no code implementations9 Jun 2024 Chih-Kai Yang, Kuan-Po Huang, Hung-Yi Lee

This research explores how the information of prompts interacts with the high-performing speech recognition model, Whisper.

speech-recognition Speech Recognition

DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models

no code implementations8 Jun 2024 Tzu-Quan Lin, Hung-Yi Lee, Hao Tang

We introduce Data Adaptive Self-Supervised Early Exit (DAISY), an approach that decides when to exit based on the self-supervised loss, eliminating the need for multiple round of training and fine-tuning.

Emo-bias: A Large Scale Evaluation of Social Bias on Speech Emotion Recognition

no code implementations7 Jun 2024 Yi-Cheng Lin, Haibin Wu, Huang-Cheng Chou, Chi-Chun Lee, Hung-Yi Lee

The rapid growth of Speech Emotion Recognition (SER) has diverse global applications, from improving human-computer interactions to aiding mental health diagnostics.

Self-Supervised Learning Speech Emotion Recognition

On the social bias of speech self-supervised models

no code implementations7 Jun 2024 Yi-Cheng Lin, Tzu-Quan Lin, Hsi-Che Lin, Andy T. Liu, Hung-Yi Lee

We probe how various factors, such as model architecture, size, and training methodologies, influence the propagation of social bias within these models.

Model Compression Self-Supervised Learning

Singing Voice Graph Modeling for SingFake Detection

1 code implementation5 Jun 2024 Xuanjun Chen, Haibin Wu, Jyh-Shing Roger Jang, Hung-Yi Lee

Detecting singing voice deepfakes, or SingFake, involves determining the authenticity and copyright of a singing voice.

DeepFake Detection Face Swapping

Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition

no code implementations5 Jun 2024 Hsuan Su, Hua Farn, Fan-Yun Sun, Shang-Tse Chen, Hung-Yi Lee

Synthetic data is widely used in speech recognition due to the availability of text-to-speech models, which facilitate adapting models to previously unseen text domains.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

InstructionCP: A fast approach to transfer Large Language Models into target language

no code implementations30 May 2024 Kuang-Ming Chen, Hung-Yi Lee

To adapt these models to other languages, continual pre-training (CP) is often employed, followed by supervised fine-tuning (SFT) to maintain conversational abilities.

Instruction Following

LLM Discussion: Enhancing the Creativity of Large Language Models via Discussion Framework and Role-Play

no code implementations10 May 2024 Li-Chun Lu, Shou-Jen Chen, Tsung-Min Pai, Chan-Hung Yu, Hung-Yi Lee, Shao-Hua Sun

Large language models (LLMs) have shown exceptional proficiency in natural language processing but often fall short of generating creative and original responses to open-ended questions.

Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations

1 code implementation20 Feb 2024 Guan-Ting Lin, Cheng-Han Chiang, Hung-Yi Lee

When using text-only LLMs to model spoken dialogue, text-only LLMs cannot give different responses based on the speaking style of the current turn.

Sentence

Codec-SUPERB: An In-Depth Analysis of Sound Codec Models

1 code implementation20 Feb 2024 Haibin Wu, Ho-Lam Chung, Yi-Cheng Lin, Yuan-Kuei Wu, Xuanjun Chen, Yu-Chi Pai, Hsiu-Hsuan Wang, Kai-Wei Chang, Alexander H. Liu, Hung-Yi Lee

The sound codec's dual roles in minimizing data transmission latency and serving as tokenizers underscore its critical importance.

Towards audio language modeling - an overview

no code implementations20 Feb 2024 Haibin Wu, Xuanjun Chen, Yi-Cheng Lin, Kai-Wei Chang, Ho-Lam Chung, Alexander H. Liu, Hung-Yi Lee

Neural audio codecs are initially introduced to compress audio data into compact codes to reduce transmission latency.

Language Modelling

Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations

1 code implementation8 Feb 2024 Cheng-Han Chiang, Hung-Yi Lee

We show that D-FActScore can better assess the factuality of paragraphs with entity ambiguity than FActScore.

REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR

1 code implementation6 Feb 2024 Liang-Hsuan Tseng, En-Pei Hu, Cheng-Han Chiang, Yuan Tseng, Hung-Yi Lee, Lin-shan Lee, Shao-Hua Sun

REBORN alternates between (1) training a segmentation model that predicts the boundaries of the segmental structures in speech signals and (2) training the phoneme prediction model, whose input is the speech feature segmented by the segmentation model, to predict a phoneme transcription.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

SpeechDPR: End-to-End Spoken Passage Retrieval for Open-Domain Spoken Question Answering

no code implementations24 Jan 2024 Chyi-Jiunn Lin, Guan-Ting Lin, Yung-Sung Chuang, Wei-Lun Wu, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Lin-shan Lee

However, the real-world problem of Open-domain SQA (openSQA), in which the machine needs to first retrieve passages that possibly contain the answer from a spoken archive in addition, was never considered.

Passage Retrieval Question Answering +4

Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization

no code implementations23 Jan 2024 Wei-Ping Huang, Sung-Feng Huang, Hung-Yi Lee

This paper presents an effective transfer learning framework for language adaptation in text-to-speech systems, with a focus on achieving language adaptation using minimal labeled and unlabeled data.

Transfer Learning

Examining Forgetting in Continual Pre-training of Aligned Large Language Models

no code implementations6 Jan 2024 Chen-An Li, Hung-Yi Lee

Recent advances in Large Language Models (LLMs) have exhibited remarkable proficiency across various tasks.

Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks

no code implementations5 Jan 2024 Kevin Everson, Yile Gu, Huck Yang, Prashanth Gurunath Shivakumar, Guan-Ting Lin, Jari Kolehmainen, Ivan Bulyko, Ankur Gandhe, Shalini Ghosh, Wael Hamza, Hung-Yi Lee, Ariya Rastrow, Andreas Stolcke

In the realm of spoken language understanding (SLU), numerous natural language understanding (NLU) methodologies have been adapted by supplying large language models (LLMs) with transcribed speech instead of conventional written text.

In-Context Learning intent-classification +6

PEFT for Speech: Unveiling Optimal Placement, Merging Strategies, and Ensemble Techniques

no code implementations4 Jan 2024 Tzu-Han Lin, How-Shing Wang, Hao-Yung Weng, Kuang-Chen Peng, Zih-Ching Chen, Hung-Yi Lee

Our study conducts extensive experiments to compare different PEFT methods and their layer-wise placement adapting Differentiable Architecture Search (DARTS).

Ensemble Learning Self-Supervised Learning

Investigating Zero-Shot Generalizability on Mandarin-English Code-Switched ASR and Speech-to-text Translation of Recent Foundation Models with Self-Supervision and Weak Supervision

1 code implementation30 Dec 2023 Chih-Kai Yang, Kuan-Po Huang, Ke-Han Lu, Chun-Yi Kuan, Chi-Yuan Hsiao, Hung-Yi Lee

This work evaluated several cutting-edge large-scale foundation models based on self-supervision or weak supervision, including SeamlessM4T, SeamlessM4T v2, and Whisper-large-v3, on three code-switched corpora.

Speech-to-Text Translation

Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

no code implementations23 Dec 2023 Guan-Ting Lin, Prashanth Gurunath Shivakumar, Ankur Gandhe, Chao-Han Huck Yang, Yile Gu, Shalini Ghosh, Andreas Stolcke, Hung-Yi Lee, Ivan Bulyko

Specifically, our framework serializes tasks in the order of current paralinguistic attribute prediction, response paralinguistic attribute prediction, and response text generation with autoregressive conditioning.

Attribute Language Modelling +4

Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models

no code implementations17 Oct 2023 Hsuan Su, Cheng-Chu Cheng, Hua Farn, Shachi H Kumar, Saurav Sahay, Shang-Tse Chen, Hung-Yi Lee

Recently, researchers have made considerable improvements in dialogue systems with the progress of large language models (LLMs) such as ChatGPT and GPT-4.

In-Context Learning

Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond

no code implementations9 Oct 2023 Jiatong Shi, William Chen, Dan Berrebbi, Hsiu-Hsuan Wang, Wei-Ping Huang, En-Pei Hu, Ho-Lam Chuang, Xuankai Chang, Yuxun Tang, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Shinji Watanabe

The 2023 Multilingual Speech Universal Performance Benchmark (ML-SUPERB) Challenge expands upon the acclaimed SUPERB framework, emphasizing self-supervised models in multilingual speech recognition and language identification.

Language Identification speech-recognition +1

A Closer Look into Automatic Evaluation Using Large Language Models

1 code implementation9 Oct 2023 Cheng-Han Chiang, Hung-Yi Lee

In this paper, we analyze LLM evaluation (Chiang and Lee, 2023) and G-Eval (Liu et al., 2023), and we discuss how those details in the evaluation process change how well the ratings given by LLMs correlate with human ratings.

Chat Vector: A Simple Approach to Equip LLMs with Instruction Following and Model Alignment in New Languages

no code implementations7 Oct 2023 Shih-Cheng Huang, Pin-Zu Li, Yu-Chi Hsu, Kuang-Ming Chen, Yu Tung Lin, Shih-Kai Hsiao, Richard Tzong-Han Tsai, Hung-Yi Lee

By simply adding the chat vector to a continual pre-trained model's weights, we can endow the model with chat capabilities in new languages without the need for further training.

Instruction Following

Zero Resource Code-switched Speech Benchmark Using Speech Utterance Pairs For Multiple Spoken Languages

1 code implementation4 Oct 2023 Kuan-Po Huang, Chih-Kai Yang, Yu-Kuan Fu, Ewan Dunbar, Hung-Yi Lee

We introduce a new zero resource code-switched speech benchmark designed to directly assess the code-switching capabilities of self-supervised speech encoders.

Language Modelling

Investigating Human-Identifiable Features Hidden in Adversarial Perturbations

no code implementations28 Sep 2023 Dennis Y. Menn, Tzu-hsun Feng, Sriram Vishwanath, Hung-Yi Lee

Our study contributes to a deeper understanding of the underlying mechanisms behind adversarial attacks and offers insights for the development of more resilient defense strategies for neural networks.

Towards General-Purpose Text-Instruction-Guided Voice Conversion

no code implementations25 Sep 2023 Chun-Yi Kuan, Chen An Li, Tsu-Yuan Hsu, Tse-Yang Lin, Ho-Lam Chung, Kai-Wei Chang, Shuo-Yiin Chang, Hung-Yi Lee

This paper introduces a novel voice conversion (VC) model, guided by text instructions such as "articulate slowly with a deep tone" or "speak in a cheerful boyish voice".

Language Modelling Specificity +1

Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

1 code implementation18 Sep 2023 Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan Sharma, Shinji Watanabe, Bhiksha Ramakrishnan, Shady Shehata, Hung-Yi Lee

To achieve comprehensive coverage of diverse speech tasks and harness instruction tuning, we invite the community to collaborate and contribute, facilitating the dynamic growth of the benchmark.

Improving Non-autoregressive Translation Quality with Pretrained Language Model, Embedding Distillation and Upsampling Strategy for CTC

no code implementations10 Jun 2023 Shen-sian Syu, Juncheng Xie, Hung-Yi Lee

In our experiments, our model outperforms the baseline autoregressive model (Transformer \textit{base}) on multiple datasets, including WMT'14 DE$\leftrightarrow$EN, WMT'16 RO$\leftrightarrow$EN, and IWSLT'14 DE$\leftrightarrow$EN.

Language Modelling Pretrained Multilingual Language Models +1

Revealing the Blind Spot of Sentence Encoder Evaluation by HEROS

no code implementations8 Jun 2023 Cheng-Han Chiang, Yung-Sung Chuang, James Glass, Hung-Yi Lee

We also show that even if two SEs have similar performance on STS benchmarks, they can have very different behavior on HEROS.

Negation Sentence +1

SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts

no code implementations3 Jun 2023 Haibin Wu, Kai-Wei Chang, Yuan-Kuei Wu, Hung-Yi Lee

In this paper, we present pioneering research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen, with around 10M trainable parameters.

Open-Ended Question Answering

Why We Should Report the Details in Subjective Evaluation of TTS More Rigorously

1 code implementation3 Jun 2023 Cheng-Han Chiang, Wei-Ping Huang, Hung-Yi Lee

This paper emphasizes the importance of reporting experiment details in subjective evaluations and demonstrates how such details can significantly impact evaluation results in the field of speech synthesis.

Speech Synthesis

How to Estimate Model Transferability of Pre-Trained Speech Models?

1 code implementation1 Jun 2023 Zih-Ching Chen, Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Shuo-Yiin Chang, Rohit Prabhavalkar, Hung-Yi Lee, Tara N. Sainath

In this work, we introduce a "score-based assessment" framework for estimating the transferability of pre-trained speech models (PSMs) for fine-tuning target tasks.

MiniSUPERB: Lightweight Benchmark for Self-supervised Speech Models

1 code implementation30 May 2023 Yu-Hsiang Wang, Huang-Yu Chen, Kai-Wei Chang, Winston Hsu, Hung-Yi Lee

In this paper, we introduce MiniSUPERB, a lightweight benchmark that efficiently evaluates SSL speech models with comparable results to SUPERB but lower computational costs significantly.

Self-Supervised Learning

The defender's perspective on automatic speaker verification: An overview

no code implementations22 May 2023 Haibin Wu, Jiawen Kang, Lingwei Meng, Helen Meng, Hung-Yi Lee

Automatic speaker verification (ASV) plays a critical role in security-sensitive environments.

Speaker Verification

ML-SUPERB: Multilingual Speech Universal PERformance Benchmark

no code implementations18 May 2023 Jiatong Shi, Dan Berrebbi, William Chen, Ho-Lam Chung, En-Pei Hu, Wei Ping Huang, Xuankai Chang, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Shinji Watanabe

Speech processing Universal PERformance Benchmark (SUPERB) is a leaderboard to benchmark the performance of Self-Supervised Learning (SSL) models on various speech processing tasks.

Automatic Speech Recognition Language Identification +3

Can Large Language Models Be an Alternative to Human Evaluations?

no code implementations3 May 2023 Cheng-Han Chiang, Hung-Yi Lee

We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation: the texts rated higher by human experts are also rated higher by the LLMs.

Story Generation

Cascading and Direct Approaches to Unsupervised Constituency Parsing on Spoken Sentences

1 code implementation15 Mar 2023 Yuan Tseng, Cheng-I Lai, Hung-Yi Lee

The goal is to determine the spoken sentences' hierarchical syntactic structure in the form of constituency parse trees, such that each node is a span of audio that corresponds to a constituent.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

SpeechPrompt v2: Prompt Tuning for Speech Classification Tasks

no code implementations1 Mar 2023 Kai-Wei Chang, Yu-Kai Wang, Hua Shen, Iu-thing Kang, Wei-Cheng Tseng, Shang-Wen Li, Hung-Yi Lee

For speech processing, SpeechPrompt shows its high parameter efficiency and competitive performance on a few speech classification tasks.

Ranked #17 on Spoken Language Understanding on Fluent Speech Commands (using extra training data)

Classification Language Modelling +1

Ensemble knowledge distillation of self-supervised speech models

no code implementations24 Feb 2023 Kuan-Po Huang, Tzu-hsun Feng, Yu-Kuan Fu, Tsu-Yuan Hsu, Po-Chieh Yen, Wei-Cheng Tseng, Kai-Wei Chang, Hung-Yi Lee

We tried two different aggregation techniques, layerwise-average and layerwise-concatenation, to the representations of different teacher models and found that the former was more effective.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Hierarchical Programmatic Reinforcement Learning via Learning to Compose Programs

no code implementations30 Jan 2023 Guan-Ting Liu, En-Pei Hu, Pu-Jen Cheng, Hung-Yi Lee, Shao-Hua Sun

Aiming to produce reinforcement learning (RL) policies that are human-interpretable and can generalize better to novel scenarios, Trivedi et al. (2021) present a method (LEAPS) that first learns a program embedding space to continuously parameterize diverse programs from a pre-generated program dataset, and then searches for a task-solving program in the learned program embedding space when given a task.

reinforcement-learning Reinforcement Learning (RL)

SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks

no code implementations20 Dec 2022 Suwon Shon, Siddhant Arora, Chyi-Jiunn Lin, Ankita Pasad, Felix Wu, Roshan Sharma, Wei-Lun Wu, Hung-Yi Lee, Karen Livescu, Shinji Watanabe

In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape.

Dialog Act Classification Question Answering +4

Systematic Analysis for Pretrained Language Model Priming for Parameter-Efficient Fine-tuning

no code implementations2 Dec 2022 Shih-Cheng Huang, Shih-Heng Wang, Min-Han Shih, Saurav Sahay, Hung-Yi Lee

To tackle these issues, we propose a general PE priming framework to enhance and explore the few-shot adaptation and generalization ability of PE methods.

Domain Generalization Language Modelling

CHAPTER: Exploiting Convolutional Neural Network Adapters for Self-supervised Speech Models

no code implementations1 Dec 2022 Zih-Ching Chen, Yu-Shun Sung, Hung-Yi Lee

However, such efficient tuning techniques only provide adaptation at the transformer layer, but failed to perform adaptation at the feature extractor.

Self-Supervised Learning

EURO: ESPnet Unsupervised ASR Open-source Toolkit

1 code implementation30 Nov 2022 Dongji Gao, Jiatong Shi, Shun-Po Chuang, Leibny Paola Garcia, Hung-Yi Lee, Shinji Watanabe, Sanjeev Khudanpur

This paper describes the ESPnet Unsupervised ASR Open-source Toolkit (EURO), an end-to-end open-source toolkit for unsupervised automatic speech recognition (UASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Model Extraction Attack against Self-supervised Speech Models

no code implementations29 Nov 2022 Tsu-Yuan Hsu, Chen-An Li, Tung-Yu Wu, Hung-Yi Lee

In the first stage, SSL is conducted on the large-scale unlabeled corpus to pre-train a small speech model.

Model extraction Self-Supervised Learning

MelHuBERT: A simplified HuBERT on Mel spectrograms

1 code implementation17 Nov 2022 Tzu-Quan Lin, Hung-Yi Lee, Hao Tang

Self-supervised models have had great success in learning speech representations that can generalize to various downstream tasks.

Automatic Speech Recognition Self-Supervised Learning +3

Compressing Transformer-based self-supervised models for speech processing

1 code implementation17 Nov 2022 Tzu-Quan Lin, Tsung-Huan Yang, Chun-Yao Chang, Kuang-Ming Chen, Tzu-hsun Feng, Hung-Yi Lee, Hao Tang

Despite the success of Transformers in self- supervised learning with applications to various downstream tasks, the computational cost of training and inference remains a major challenge for applying these models to a wide spectrum of devices.

Knowledge Distillation Model Compression +1

Bridging Speech and Textual Pre-trained Models with Unsupervised ASR

no code implementations6 Nov 2022 Jiatong Shi, Chan-Jan Hsu, Holam Chung, Dongji Gao, Paola Garcia, Shinji Watanabe, Ann Lee, Hung-Yi Lee

To be specific, we propose to use unsupervised automatic speech recognition (ASR) as a connector that bridges different modalities used in speech and textual pre-trained models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Once-for-All Sequence Compression for Self-Supervised Speech Models

no code implementations4 Nov 2022 Hsuan-Jui Chen, Yen Meng, Hung-Yi Lee

The sequence length along the time axis is often the dominant factor of the computation in speech processing.

M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval

no code implementations2 Nov 2022 Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-Yi Lee, David Harwath

This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.

Image Retrieval Text Retrieval

T5lephone: Bridging Speech and Text Self-supervised Models for Spoken Language Understanding via Phoneme level T5

1 code implementation1 Nov 2022 Chan-Jan Hsu, Ho-Lam Chung, Hung-Yi Lee, Yu Tsao

In Spoken language understanding (SLU), a natural solution is concatenating pre-trained speech models (e. g. HuBERT) and pretrained language models (PLM, e. g. T5).

Language Modelling Question Answering +1

Multimodal Transformer Distillation for Audio-Visual Synchronization

2 code implementations27 Oct 2022 Xuanjun Chen, Haibin Wu, Chung-Che Wang, Hung-Yi Lee, Jyh-Shing Roger Jang

This paper proposed an MTDVocaLiST model, which is trained by our proposed multimodal Transformer distillation (MTD) loss.

Audio-Visual Synchronization

On Compressing Sequences for Self-Supervised Speech Models

no code implementations13 Oct 2022 Yen Meng, Hsuan-Jui Chen, Jiatong Shi, Shinji Watanabe, Paola Garcia, Hung-Yi Lee, Hao Tang

Subsampling while training self-supervised models not only improves the overall performance on downstream tasks under certain frame rates, but also brings significant speed-up in inference.

Self-Supervised Learning

Are Synonym Substitution Attacks Really Synonym Substitution Attacks?

no code implementations6 Oct 2022 Cheng-Han Chiang, Hung-Yi Lee

In this paper, we explore the following question: Are synonym substitution attacks really synonym substitution attacks (SSAs)?

Sentence

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

1 code implementation3 Oct 2022 Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-Yi Lee, David Harwath

Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly.

Language Modelling Text Retrieval

The Efficacy of Self-Supervised Speech Models for Audio Representations

1 code implementation26 Sep 2022 Tung-Yu Wu, Chen-An Li, Tzu-Han Lin, Tsu-Yuan Hsu, Hung-Yi Lee

Extensive experiments on speech and non-speech audio datasets are conducted to investigate the representation abilities of our ensemble method and its single constituent model.

Pitch Classification Representation Learning +2

Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding

no code implementations27 Jun 2022 Wei-Ping Huang, Po-Chun Chen, Sung-Feng Huang, Hung-Yi Lee

This paper studies a transferable phoneme embedding framework that aims to deal with the cross-lingual text-to-speech (TTS) problem under the few-shot setting.

Few-Shot Learning Transfer Learning

Tackling Spoofing-Aware Speaker Verification with Multi-Model Fusion

no code implementations18 Jun 2022 Haibin Wu, Jiawen Kang, Lingwei Meng, Yang Zhang, Xixin Wu, Zhiyong Wu, Hung-Yi Lee, Helen Meng

However, previous works show that state-of-the-art ASV models are seriously vulnerable to voice spoofing attacks, and the recently proposed high-performance spoofing countermeasure (CM) models only focus solely on the standalone anti-spoofing tasks, and ignore the subsequent speaker verification process.

Open-Ended Question Answering Speaker Verification

Searching for the Essence of Adversarial Perturbations

no code implementations30 May 2022 Dennis Y. Menn, Tzu-hsun Feng, Hung-Yi Lee

Neural networks have demonstrated state-of-the-art performance in various machine learning fields.

Autonomous Driving

Structured Prompt Tuning

no code implementations24 May 2022 Chi-Liang Liu, Hung-Yi Lee, Wen-tau Yih

We propose structured prompt tuning, a simple and effective method to improve prompt tuning.

Self-Supervised Speech Representation Learning: A Review

no code implementations21 May 2022 Abdelrahman Mohamed, Hung-Yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, Tara N. Sainath, Shinji Watanabe

Although self-supervised speech representation is still a nascent research area, it is closely related to acoustic word embedding and learning with zero lexical resources, both of which have seen active research for many years.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding

no code implementations ACL 2022 Chan-Jan Hsu, Hung-Yi Lee, Yu Tsao

Transformer-based models are widely used in natural language understanding (NLU) tasks, and multimodal transformers have been effective in visual-language tasks.

Natural Language Understanding

Re-Examining Human Annotations for Interpretable NLP

no code implementations10 Apr 2022 Cheng-Han Chiang, Hung-Yi Lee

Our results reveal that the annotation quality is highly subject to the workers' qualification, and workers can be guided to provide certain annotations by the instructions.

Understanding, Detecting, and Separating Out-of-Distribution Samples and Adversarial Samples in Text Classification

no code implementations9 Apr 2022 Cheng-Han Chiang, Hung-Yi Lee

Based on our observations, we propose a simple method to separate ID, OOD, and Adv samples using the hidden representations and output probabilities of the model.

text-classification Text Classification

SpeechPrompt: An Exploration of Prompt Tuning on Generative Spoken Language Model for Speech Processing Tasks

1 code implementation31 Mar 2022 Kai-Wei Chang, Wei-Cheng Tseng, Shang-Wen Li, Hung-Yi Lee

We report in this paper the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM).

Language Modelling Self-Supervised Learning

Improving Distortion Robustness of Self-supervised Speech Processing Tasks with Domain Adaptation

no code implementations30 Mar 2022 Kuan Po Huang, Yu-Kuan Fu, Yu Zhang, Hung-Yi Lee

Speech distortions are a long-standing problem that degrades the performance of supervisely trained speech processing models.

Domain Adaptation

Spoofing-Aware Speaker Verification by Multi-Level Fusion

no code implementations29 Mar 2022 Haibin Wu, Lingwei Meng, Jiawen Kang, Jinchao Li, Xu Li, Xixin Wu, Hung-Yi Lee, Helen Meng

In the second-level fusion, the CM score and ASV scores directly from ASV systems will be concatenated into a prediction block for the final decision.

Speaker Verification

Listen, Adapt, Better WER: Source-free Single-utterance Test-time Adaptation for Automatic Speech Recognition

2 code implementations27 Mar 2022 Guan-Ting Lin, Shang-Wen Li, Hung-Yi Lee

Although deep learning-based end-to-end Automatic Speech Recognition (ASR) has shown remarkable performance in recent years, it suffers severe performance regression on test samples drawn from different data distributions.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Exploring Continuous Integrate-and-Fire for Adaptive Simultaneous Speech Translation

1 code implementation22 Mar 2022 Chih-Chiang Chang, Hung-Yi Lee

Simultaneous speech translation (SimulST) is a challenging task aiming to translate streaming speech before the complete input is observed.

Translation

Anticipation-Free Training for Simultaneous Machine Translation

1 code implementation IWSLT (ACL) 2022 Chih-Chiang Chang, Shun-Po Chuang, Hung-Yi Lee

Existing methods increase latency or introduce adaptive read-write policies for SimulMT models to handle local reordering and improve translation quality.

Hallucination Machine Translation +2

Membership Inference Attacks Against Self-supervised Speech Models

1 code implementation9 Nov 2021 Wei-Cheng Tseng, Wei-Tsung Kao, Hung-Yi Lee

Recently, adapting the idea of self-supervised learning (SSL) on continuous speech has started gaining attention.

Self-Supervised Learning

Characterizing the adversarial vulnerability of speech self-supervised learning

no code implementations8 Nov 2021 Haibin Wu, Bo Zheng, Xu Li, Xixin Wu, Hung-Yi Lee, Helen Meng

As the paradigm of the self-supervised learning upstream model followed by downstream tasks arouses more attention in the speech community, characterizing the adversarial robustness of such paradigm is of high priority.

Adversarial Robustness Benchmarking +3

Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech

1 code implementation7 Nov 2021 Sung-Feng Huang, Chyi-Jiunn Lin, Da-Rong Liu, Yi-Chen Chen, Hung-Yi Lee

On the one hand, speaker adaptation methods fine-tune a trained multi-speaker text-to-speech (TTS) model with few enrolled samples.

Meta-Learning Speech Synthesis

Don't speak too fast: The impact of data bias on self-supervised speech models

no code implementations15 Oct 2021 Yen Meng, Yi-Hui Chou, Andy T. Liu, Hung-Yi Lee

Self-supervised Speech Models (S3Ms) have been proven successful in many speech downstream tasks, like ASR.

Toward Degradation-Robust Voice Conversion

1 code implementation14 Oct 2021 Chien-yu Huang, Kai-Wei Chang, Hung-Yi Lee

However, in real-world scenarios, it is difficult to collect clean utterances of a speaker, and they are usually degraded by noises or reverberations.

Denoising Speech Enhancement +1

S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations

2 code implementations12 Oct 2021 Wen-Chin Huang, Shu-wen Yang, Tomoki Hayashi, Hung-Yi Lee, Shinji Watanabe, Tomoki Toda

In this work, we provide a series of in-depth analyses by benchmarking on the two tasks in VCC2020, namely intra-/cross-lingual any-to-one (A2O) VC, as well as an any-to-any (A2A) setting.

Benchmarking Voice Conversion

CheerBots: Chatbots toward Empathy and Emotionusing Reinforcement Learning

no code implementations8 Oct 2021 Jiun-Hao Jhan, Chao-Peng Liu, Shyh-Kang Jeng, Hung-Yi Lee

Apart from the coherence and fluency of responses, an empathetic chatbot emphasizes more on people's feelings.

Chatbot reinforcement-learning +2

Analyzing the Robustness of Unsupervised Speech Recognition

no code implementations7 Oct 2021 Guan-Ting Lin, Chan-Jan Hsu, Da-Rong Liu, Hung-Yi Lee, Yu Tsao

In this work, we further analyze the training robustness of unsupervised ASR on the domain mismatch scenarios in which the domains of unpaired speech and text are different.

Generative Adversarial Network speech-recognition +2

DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT

1 code implementation5 Oct 2021 Heng-Jui Chang, Shu-wen Yang, Hung-Yi Lee

Self-supervised speech representation learning methods like wav2vec 2. 0 and Hidden-unit BERT (HuBERT) leverage unlabeled speech data for pre-training and offer good representations for numerous speech processing tasks.

Multi-Task Learning Representation Learning +1

On the Transferability of Pre-trained Language Models: A Study from Artificial Datasets

1 code implementation8 Sep 2021 Cheng-Han Chiang, Hung-Yi Lee

In this work, we study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.

Parallelized Reverse Curriculum Generation

no code implementations4 Aug 2021 Zih-Yun Chiu, Yi-Lin Tuan, Hung-Yi Lee, Li-Chen Fu

For reinforcement learning (RL), it is challenging for an agent to master a task that requires a specific series of actions due to sparse rewards.

Reinforcement Learning (RL)

Voting for the right answer: Adversarial defense for speaker verification

1 code implementation15 Jun 2021 Haibin Wu, Yang Zhang, Zhiyong Wu, Dong Wang, Hung-Yi Lee

Automatic speaker verification (ASV) is a well developed technology for biometric identification, and has been ubiquitous implemented in security-critic applications, such as banking and access control.

Adversarial Defense Speaker Verification

Investigating the Reordering Capability in CTC-based Non-Autoregressive End-to-End Speech Translation

1 code implementation Findings (ACL) 2021 Shun-Po Chuang, Yung-Sung Chuang, Chih-Chiang Chang, Hung-Yi Lee

We study the possibilities of building a non-autoregressive speech-to-text translation model using connectionist temporal classification (CTC), and use CTC-based automatic speech recognition as an auxiliary task to improve the performance.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Improving Cross-Lingual Reading Comprehension with Self-Training

no code implementations8 May 2021 Wei-Cheng Huang, Chien-yu Huang, Hung-Yi Lee

Substantial improvements have been made in machine reading comprehension, where the machine answers questions based on a given context.

Machine Reading Comprehension

S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations

3 code implementations7 Apr 2021 Jheng-Hao Lin, Yist Y. Lin, Chung-Ming Chien, Hung-Yi Lee

AUTOVC used dvector to extract speaker information, and self-supervised learning (SSL) features like wav2vec 2. 0 is used in FragmentVC to extract the phonetic content information.

Self-Supervised Learning Voice Conversion

Towards Lifelong Learning of End-to-end ASR

no code implementations4 Apr 2021 Heng-Jui Chang, Hung-Yi Lee, Lin-shan Lee

We can collect new data describing the new environment and fine-tune the system, but this naturally leads to higher error rates for the earlier datasets, referred to as catastrophic forgetting.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Auto-KWS 2021 Challenge: Task, Datasets, and Baselines

1 code implementation31 Mar 2021 Jingsong Wang, Yuxuan He, Chunyu Zhao, Qijie Shao, Wei-Wei Tu, Tom Ko, Hung-Yi Lee, Lei Xie

Auto-KWS 2021 challenge calls for automated machine learning (AutoML) solutions to automate the process of applying machine learning to a customized keyword spotting task.

AutoML BIG-bench Machine Learning +1

Is BERT a Cross-Disciplinary Knowledge Learner? A Surprising Finding of Pre-trained Models' Transferability

no code implementations12 Mar 2021 Wei-Tsung Kao, Hung-Yi Lee

This paper investigates whether the power of the models pre-trained on text data, such as BERT, can be transferred to general token sequence classification applications.

General Classification text-classification +1

Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech

1 code implementation6 Mar 2021 Chung-Ming Chien, Jheng-Hao Lin, Chien-yu Huang, Po-chun Hsu, Hung-Yi Lee

The few-shot multi-speaker multi-style voice cloning task is to synthesize utterances with voice and speaking style similar to a reference speaker given only a few reference samples.

Voice Cloning Voice Conversion

Pre-Training a Language Model Without Human Language

no code implementations22 Dec 2020 Cheng-Han Chiang, Hung-Yi Lee

In this paper, we study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance.

Language Modelling

TaylorGAN: Neighbor-Augmented Policy Update Towards Sample-Efficient Natural Language Generation

1 code implementation NeurIPS 2020 Chun-Hsing Lin, Siang-Ruei Wu, Hung-Yi Lee, Yun-Nung Chen

Score function-based natural language generation (NLG) approaches such as REINFORCE, in general, suffer from low sample efficiency and training instability problems.

Diversity Text Generation

TaylorGAN: Neighbor-Augmented Policy Update for Sample-Efficient Natural Language Generation

1 code implementation27 Nov 2020 Chun-Hsing Lin, Siang-Ruei Wu, Hung-Yi Lee, Yun-Nung Chen

Score function-based natural language generation (NLG) approaches such as REINFORCE, in general, suffer from low sample efficiency and training instability problems.

Diversity Text Generation

How Far Are We from Robust Voice Conversion: A Survey

no code implementations24 Nov 2020 Tzu-Hsien Huang, Jheng-Hao Lin, Chien-yu Huang, Hung-Yi Lee

Voice conversion technologies have been greatly improved in recent years with the help of deep learning, but their capabilities of producing natural sounding utterances in different conditions remain unclear.

Speaker Identification Voice Conversion

Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis

1 code implementation12 Nov 2020 Chung-Ming Chien, Hung-Yi Lee

Prosody modeling is an essential component in modern text-to-speech (TTS) frameworks.

Speech Synthesis

AGAIN-VC: A One-shot Voice Conversion using Activation Guidance and Adaptive Instance Normalization

1 code implementation31 Oct 2020 Yen-Hao Chen, Da-Yi Wu, Tsung-Han Wu, Hung-Yi Lee

With a proper activation as an information bottleneck on content embeddings, the trade-off between the synthesis quality and the speaker similarity of the converted speech is improved drastically.

Audio and Speech Processing Sound

Stabilizing Label Assignment for Speech Separation by Self-supervised Pre-training

1 code implementation29 Oct 2020 Sung-Feng Huang, Shun-Po Chuang, Da-Rong Liu, Yi-Chen Chen, Gene-Ping Yang, Hung-Yi Lee

Speech separation has been well developed, with the very successful permutation invariant training (PIT) approach, although the frequent label assignment switching happening during PIT training remains to be a problem when better convergence speed and achievable performance are desired.

Ranked #6 on Speech Separation on Libri2Mix (using extra training data)

Speaker Separation Speech Enhancement +1