Search Results for author: Hung-Yi Lee

Found 229 papers, 91 papers with code

EURO: ESPnet Unsupervised ASR Open-source Toolkit

1 code implementation30 Nov 2022 Dongji Gao, Jiatong Shi, Shun-Po Chuang, Leibny Paola Garcia, Hung-Yi Lee, Shinji Watanabe, Sanjeev Khudanpur

This paper describes the ESPnet Unsupervised ASR Open-source Toolkit (EURO), an end-to-end open-source toolkit for unsupervised automatic speech recognition (UASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders

7 code implementations25 Oct 2019 Andy T. Liu, Shu-wen Yang, Po-Han Chi, Po-chun Hsu, Hung-Yi Lee

We present Mockingjay as a new speech representation learning approach, where bidirectional Transformer encoders are pre-trained on a large amount of unlabeled speech.

General Classification Representation Learning +3

Defense for Black-box Attacks on Anti-spoofing Models by Self-Supervised Learning

5 code implementations5 Jun 2020 Haibin Wu, Andy T. Liu, Hung-Yi Lee

To explore this issue, we proposed to employ Mockingjay, a self-supervised learning based model, to protect anti-spoofing models against adversarial attacks in the black-box scenario.

Self-Supervised Learning Speaker Verification +1

S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations

3 code implementations7 Apr 2021 Jheng-Hao Lin, Yist Y. Lin, Chung-Ming Chien, Hung-Yi Lee

AUTOVC used dvector to extract speaker information, and self-supervised learning (SSL) features like wav2vec 2. 0 is used in FragmentVC to extract the phonetic content information.

Self-Supervised Learning Voice Conversion

DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT

1 code implementation5 Oct 2021 Heng-Jui Chang, Shu-wen Yang, Hung-Yi Lee

Self-supervised speech representation learning methods like wav2vec 2. 0 and Hidden-unit BERT (HuBERT) leverage unlabeled speech data for pre-training and offer good representations for numerous speech processing tasks.

Multi-Task Learning Representation Learning

S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations

2 code implementations12 Oct 2021 Wen-Chin Huang, Shu-wen Yang, Tomoki Hayashi, Hung-Yi Lee, Shinji Watanabe, Tomoki Toda

In this work, we provide a series of in-depth analyses by benchmarking on the two tasks in VCC2020, namely intra-/cross-lingual any-to-one (A2O) VC, as well as an any-to-any (A2A) setting.

Benchmarking Voice Conversion

Self-supervised Representation Learning for Speech Processing

1 code implementation NAACL (ACL) 2022 Hung-Yi Lee, Abdelrahman Mohamed, Shinji Watanabe, Tara Sainath, Karen Livescu, Shang-Wen Li, Shu-wen Yang, Katrin Kirchhoff

Due to the growing popularity of SSL, and the shared mission of the areas in bringing speech and language technologies to more use cases with better quality and scaling the technologies for under-represented languages, we propose this tutorial to systematically survey the latest SSL techniques, tools, datasets, and performance achievement in speech processing.

Representation Learning

Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis

1 code implementation12 Nov 2020 Chung-Ming Chien, Hung-Yi Lee

Prosody modeling is an essential component in modern text-to-speech (TTS) frameworks.

Speech Synthesis

Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech

1 code implementation6 Mar 2021 Chung-Ming Chien, Jheng-Hao Lin, Chien-yu Huang, Po-chun Hsu, Hung-Yi Lee

The few-shot multi-speaker multi-style voice cloning task is to synthesize utterances with voice and speaking style similar to a reference speaker given only a few reference samples.

Voice Cloning Voice Conversion

Sequence-to-sequence Automatic Speech Recognition with Word Embedding Regularization and Fused Decoding

1 code implementation28 Oct 2019 Alexander H. Liu, Tzu-Wei Sung, Shun-Po Chuang, Hung-Yi Lee, Lin-shan Lee

This allows the decoder to consider the semantic consistency during decoding by absorbing the information carried by the transformed decoder feature, which is learned to be close to the target word embedding.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Temporal Pattern Attention for Multivariate Time Series Forecasting

4 code implementations12 Sep 2018 Shun-Yao Shih, Fan-Keng Sun, Hung-Yi Lee

To obtain accurate prediction, it is crucial to model long-term dependency in time series data, which can be achieved to some good extent by recurrent neural network (RNN) with attention mechanism.

Multivariate Time Series Forecasting Time Series +1

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

11 code implementations10 Apr 2019 Ju-chieh Chou, Cheng-chieh Yeh, Hung-Yi Lee

Recently, voice conversion (VC) without parallel data has been successfully adapted to multi-target scenario in which a single model is trained to convert the input voice to many different speakers.

Voice Conversion

Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations

4 code implementations9 Apr 2018 Ju-chieh Chou, Cheng-chieh Yeh, Hung-Yi Lee, Lin-shan Lee

The decoder then takes the speaker-independent latent representation and the target speaker embedding as the input to generate the voice of the target speaker with the linguistic content of the source utterance.

Voice Conversion

Tree Transformer: Integrating Tree Structures into Self-Attention

3 code implementations IJCNLP 2019 Yau-Shian Wang, Hung-Yi Lee, Yun-Nung Chen

This paper proposes Tree Transformer, which adds an extra constraint to attention heads of the bidirectional Transformer encoder in order to encourage the attention heads to follow tree structures.

Language Modelling

FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention

2 code implementations27 Oct 2020 Yist Y. Lin, Chung-Ming Chien, Jheng-Hao Lin, Hung-Yi Lee, Lin-shan Lee

Any-to-any voice conversion aims to convert the voice from and to any speakers even unseen during training, which is much more challenging compared to one-to-one or many-to-many tasks, but much more attractive in real-world scenarios.

Disentanglement Speaker Verification +1

Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech

1 code implementation7 Nov 2021 Sung-Feng Huang, Chyi-Jiunn Lin, Da-Rong Liu, Yi-Chen Chen, Hung-Yi Lee

On the one hand, speaker adaptation methods fine-tune a trained multi-speaker text-to-speech (TTS) model with few enrolled samples.

Meta-Learning Speech Synthesis

Codec-SUPERB: An In-Depth Analysis of Sound Codec Models

1 code implementation20 Feb 2024 Haibin Wu, Ho-Lam Chung, Yi-Cheng Lin, Yuan-Kuei Wu, Xuanjun Chen, Yu-Chi Pai, Hsiu-Hsuan Wang, Kai-Wei Chang, Alexander H. Liu, Hung-Yi Lee

The sound codec's dual roles in minimizing data transmission latency and serving as tokenizers underscore its critical importance.

Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion

1 code implementation28 May 2019 Andy T. Liu, Po-chun Hsu, Hung-Yi Lee

We found that the proposed encoding method offers automatic extraction of speech content from speaker style, and is sufficient to cover full linguistic content in a given language.

Voice Conversion

AGAIN-VC: A One-shot Voice Conversion using Activation Guidance and Adaptive Instance Normalization

1 code implementation31 Oct 2020 Yen-Hao Chen, Da-Yi Wu, Tsung-Han Wu, Hung-Yi Lee

With a proper activation as an information bottleneck on content embeddings, the trade-off between the synthesis quality and the speaker similarity of the converted speech is improved drastically.

Audio and Speech Processing Sound

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

1 code implementation3 Oct 2022 Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-Yi Lee, David Harwath

Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly.

Language Modelling Retrieval +1

SpeechPrompt: An Exploration of Prompt Tuning on Generative Spoken Language Model for Speech Processing Tasks

1 code implementation31 Mar 2022 Kai-Wei Chang, Wei-Cheng Tseng, Shang-Wen Li, Hung-Yi Lee

We report in this paper the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM).

Language Modelling Self-Supervised Learning

Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

1 code implementation18 Sep 2023 Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan Sharma, Shinji Watanabe, Bhiksha Ramakrishnan, Shady Shehata, Hung-Yi Lee

To achieve comprehensive coverage of diverse speech tasks and harness instruction tuning, we invite the community to collaborate and contribute, facilitating the dynamic growth of the benchmark.

VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net architecture

1 code implementation7 Jun 2020 Da-Yi Wu, Yen-Hao Chen, Hung-Yi Lee

Voice conversion (VC) is a task that transforms the source speaker's timbre, accent, and tones in audio into another one's while preserving the linguistic content.

Disentanglement Quantization +1

WG-WaveNet: Real-Time High-Fidelity Speech Synthesis without GPU

1 code implementation15 May 2020 Po-chun Hsu, Hung-Yi Lee

As we design a flow-based model that is heavily compressed, the proposed model requires much less computational resources compared to other waveform generation models during both training and inference time; even though the model is highly compressed, the post-filter maintains the quality of generated waveform.

Speech Synthesis Text-To-Speech Synthesis Audio and Speech Processing Sound

Learning to Encode Text as Human-Readable Summaries using Generative Adversarial Networks

1 code implementation EMNLP 2018 Yau-Shian Wang, Hung-Yi Lee

The generator encodes the input text into a shorter word sequence, and the reconstructor recovers the generator input from the generator output.

Abstractive Text Summarization

Stabilizing Label Assignment for Speech Separation by Self-supervised Pre-training

1 code implementation29 Oct 2020 Sung-Feng Huang, Shun-Po Chuang, Da-Rong Liu, Yi-Chen Chen, Gene-Ping Yang, Hung-Yi Lee

Speech separation has been well developed, with the very successful permutation invariant training (PIT) approach, although the frequent label assignment switching happening during PIT training remains to be a problem when better convergence speed and achievable performance are desired.

Ranked #6 on Speech Separation on Libri2Mix (using extra training data)

Speaker Separation Speech Enhancement +1

ODSQA: Open-domain Spoken Question Answering Dataset

1 code implementation7 Aug 2018 Chia-Hsuan Lee, Shang-Ming Wang, Huan-Cheng Chang, Hung-Yi Lee

Reading comprehension by machine has been widely studied, but machine comprehension of spoken content is still a less investigated problem.

Data Augmentation Question Answering +1

Multimodal Transformer Distillation for Audio-Visual Synchronization

2 code implementations27 Oct 2022 Xuanjun Chen, Haibin Wu, Chung-Che Wang, Hung-Yi Lee, Jyh-Shing Roger Jang

This paper proposed an MTDVocaLiST model, which is trained by our proposed multimodal Transformer distillation (MTD) loss.

Audio-Visual Synchronization

MelHuBERT: A simplified HuBERT on Mel spectrograms

1 code implementation17 Nov 2022 Tzu-Quan Lin, Hung-Yi Lee, Hao Tang

Self-supervised models have had great success in learning speech representations that can generalize to various downstream tasks.

Automatic Speech Recognition Self-Supervised Learning +3

DyKgChat: Benchmarking Dialogue Generation Grounding on Dynamic Knowledge Graphs

1 code implementation IJCNLP 2019 Yi-Lin Tuan, Yun-Nung Chen, Hung-Yi Lee

This paper proposes a new task about how to apply dynamic knowledge graphs in neural conversation model and presents a novel TV series conversation corpus (DyKgChat) for the task.

Benchmarking Dialogue Generation +1

Adversarial Attacks on Spoofing Countermeasures of automatic speaker verification

1 code implementation19 Oct 2019 Songxiang Liu, Haibin Wu, Hung-Yi Lee, Helen Meng

High-performance spoofing countermeasure systems for automatic speaker verification (ASV) have been proposed in the ASVspoof 2019 challenge.

Speaker Verification

Defending Your Voice: Adversarial Attack on Voice Conversion

1 code implementation18 May 2020 Chien-yu Huang, Yist Y. Lin, Hung-Yi Lee, Lin-shan Lee

We introduce human imperceptible noise into the utterances of a speaker whose voice is to be defended.

Adversarial Attack Voice Conversion

Improving Conditional Sequence Generative Adversarial Networks by Stepwise Evaluation

1 code implementation16 Aug 2018 Yi-Lin Tuan, Hung-Yi Lee

To stabilize the training of SeqGAN, Monte Carlo tree search (MCTS) or reward at every generation step (REGS) is used to evaluate the goodness of a generated subsequence.

Dialogue Generation

Query-based Attention CNN for Text Similarity Map

2 code implementations15 Sep 2017 Tzu-Chien Liu, Yu-Hsueh Wu, Hung-Yi Lee

This network is composed of compare mechanism, two-staged CNN architecture with attention mechanism, and a prediction layer.

Question Answering Sentence +1

Learning Chinese Word Representations From Glyphs Of Characters

1 code implementation EMNLP 2017 Tzu-Ray Su, Hung-Yi Lee

The character glyph features are directly learned from the bitmaps of characters by convolutional auto-encoder(convAE), and the glyph features improve Chinese word representations which are already enhanced by character embeddings.

TaylorGAN: Neighbor-Augmented Policy Update Towards Sample-Efficient Natural Language Generation

1 code implementation NeurIPS 2020 Chun-Hsing Lin, Siang-Ruei Wu, Hung-Yi Lee, Yun-Nung Chen

Score function-based natural language generation (NLG) approaches such as REINFORCE, in general, suffer from low sample efficiency and training instability problems.

Text Generation

TaylorGAN: Neighbor-Augmented Policy Update for Sample-Efficient Natural Language Generation

1 code implementation27 Nov 2020 Chun-Hsing Lin, Siang-Ruei Wu, Hung-Yi Lee, Yun-Nung Chen

Score function-based natural language generation (NLG) approaches such as REINFORCE, in general, suffer from low sample efficiency and training instability problems.

Text Generation

Listen, Adapt, Better WER: Source-free Single-utterance Test-time Adaptation for Automatic Speech Recognition

2 code implementations27 Mar 2022 Guan-Ting Lin, Shang-Wen Li, Hung-Yi Lee

Although deep learning-based end-to-end Automatic Speech Recognition (ASR) has shown remarkable performance in recent years, it suffers severe performance regression on test samples drawn from different data distributions.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Improved Speech Separation with Time-and-Frequency Cross-domain Joint Embedding and Clustering

1 code implementation16 Apr 2019 Gene-Ping Yang, Chao-I Tuan, Hung-Yi Lee, Lin-shan Lee

Substantial effort has been reported based on approaches over spectrogram, which is well known as the standard time-and-frequency cross-domain representation for speech signals.

Clustering Speech Separation

Interrupted and cascaded permutation invariant training for speech separation

1 code implementation28 Oct 2019 Gene-Ping Yang, Szu-Lin Wu, Yao-Wen Mao, Hung-Yi Lee, Lin-shan Lee

Permutation Invariant Training (PIT) has long been a stepping stone method for training speech separation model in handling the label ambiguity problem.

Speech Separation

Noise Adaptive Speech Enhancement using Domain Adversarial Training

1 code implementation19 Jul 2018 Chien-Feng Liao, Yu Tsao, Hung-Yi Lee, Hsin-Min Wang

The proposed noise adaptive SE system contains an encoder-decoder-based enhancement model and a domain discriminator model.

Sound Audio and Speech Processing

Exploring Continuous Integrate-and-Fire for Adaptive Simultaneous Speech Translation

1 code implementation22 Mar 2022 Chih-Chiang Chang, Hung-Yi Lee

Simultaneous speech translation (SimulST) is a challenging task aiming to translate streaming speech before the complete input is observed.

Translation

Generative Adversarial Networks for Unpaired Voice Transformation on Impaired Speech

2 code implementations30 Oct 2018 Li-Wei Chen, Hung-Yi Lee, Yu Tsao

This paper focuses on using voice conversion (VC) to improve the speech intelligibility of surgical patients who have had parts of their articulators removed.

Speech Recognition Voice Conversion

Order-free Learning Alleviating Exposure Bias in Multi-label Classification

1 code implementation8 Sep 2019 Che-Ping Tsai, Hung-Yi Lee

In this paper, we propose a new framework for MLC which does not rely on a predefined label order and thus alleviates exposure bias.

General Classification Multi-Label Classification

T5lephone: Bridging Speech and Text Self-supervised Models for Spoken Language Understanding via Phoneme level T5

1 code implementation1 Nov 2022 Chan-Jan Hsu, Ho-Lam Chung, Hung-Yi Lee, Yu Tsao

In Spoken language understanding (SLU), a natural solution is concatenating pre-trained speech models (e. g. HuBERT) and pretrained language models (PLM, e. g. T5).

Language Modelling Question Answering +1

A Closer Look into Automatic Evaluation Using Large Language Models

1 code implementation9 Oct 2023 Cheng-Han Chiang, Hung-Yi Lee

In this paper, we analyze LLM evaluation (Chiang and Lee, 2023) and G-Eval (Liu et al., 2023), and we discuss how those details in the evaluation process change how well the ratings given by LLMs correlate with human ratings.

Auto-KWS 2021 Challenge: Task, Datasets, and Baselines

1 code implementation31 Mar 2021 Jingsong Wang, Yuxuan He, Chunyu Zhao, Qijie Shao, Wei-Wei Tu, Tom Ko, Hung-Yi Lee, Lei Xie

Auto-KWS 2021 challenge calls for automated machine learning (AutoML) solutions to automate the process of applying machine learning to a customized keyword spotting task.

AutoML BIG-bench Machine Learning +1

Pretrained Language Model Embryology: The Birth of ALBERT

1 code implementation EMNLP 2020 Cheng-Han Chiang, Sung-Feng Huang, Hung-Yi Lee

These findings suggest that knowledge of a pretrained model varies during pretraining, and having more pretrain steps does not necessarily provide a model with more comprehensive knowledge.

Language Modelling POS +1

Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining

1 code implementation26 Oct 2020 Cheng-I Lai, Yung-Sung Chuang, Hung-Yi Lee, Shang-Wen Li, James Glass

Much recent work on Spoken Language Understanding (SLU) is limited in at least one of three ways: models were trained on oracle text input and neglected ASR errors, models were trained to predict only intents without the slot values, or models were trained on a large amount of in-house data.

Language Modelling Spoken Language Understanding

The Efficacy of Self-Supervised Speech Models for Audio Representations

1 code implementation26 Sep 2022 Tung-Yu Wu, Chen-An Li, Tzu-Han Lin, Tsu-Yuan Hsu, Hung-Yi Lee

Extensive experiments on speech and non-speech audio datasets are conducted to investigate the representation abilities of our ensemble method and its single constituent model.

Pitch Classification Representation Learning +1

J-Net: Randomly weighted U-Net for audio source separation

1 code implementation29 Nov 2019 Bo-Wen Chen, Yen-Min Hsu, Hung-Yi Lee

According to these discoveries, we pose two questions: what is the value of randomly weighted networks in difficult generative audio tasks such as audio source separation and does such positive correlation still exist when it comes to large random networks and their trained counterparts?

Audio Source Separation

Investigating the Reordering Capability in CTC-based Non-Autoregressive End-to-End Speech Translation

1 code implementation Findings (ACL) 2021 Shun-Po Chuang, Yung-Sung Chuang, Chih-Chiang Chang, Hung-Yi Lee

We study the possibilities of building a non-autoregressive speech-to-text translation model using connectionist temporal classification (CTC), and use CTC-based automatic speech recognition as an auxiliary task to improve the performance.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Membership Inference Attacks Against Self-supervised Speech Models

1 code implementation9 Nov 2021 Wei-Cheng Tseng, Wei-Tsung Kao, Hung-Yi Lee

Recently, adapting the idea of self-supervised learning (SSL) on continuous speech has started gaining attention.

Self-Supervised Learning

Anticipation-Free Training for Simultaneous Machine Translation

1 code implementation IWSLT (ACL) 2022 Chih-Chiang Chang, Shun-Po Chuang, Hung-Yi Lee

Existing methods increase latency or introduce adaptive read-write policies for SimulMT models to handle local reordering and improve translation quality.

Hallucination Machine Translation +2

Audio Word2Vec: Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence Autoencoder

1 code implementation3 Mar 2016 Yu-An Chung, Chao-Chung Wu, Chia-Hao Shen, Hung-Yi Lee, Lin-shan Lee

The vector representations of fixed dimensionality for words (in text) offered by Word2Vec have been shown to be very useful in many application scenarios, in particular due to the semantic information they carry.

Denoising Dynamic Time Warping

Voting for the right answer: Adversarial defense for speaker verification

1 code implementation15 Jun 2021 Haibin Wu, Yang Zhang, Zhiyong Wu, Dong Wang, Hung-Yi Lee

Automatic speaker verification (ASV) is a well developed technology for biometric identification, and has been ubiquitous implemented in security-critic applications, such as banking and access control.

Adversarial Defense Speaker Verification

Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries

1 code implementation22 Mar 2017 Yu-Hsuan Wang, Cheng-Tao Chung, Hung-Yi Lee

In this paper we analyze the gate activation signals inside the gated recurrent neural networks, and find the temporal structure of such signals is highly correlated with the phoneme boundaries.

Rhythm-Flexible Voice Conversion without Parallel Data Using Cycle-GAN over Phoneme Posteriorgram Sequences

1 code implementation9 Aug 2018 Cheng-chieh Yeh, Po-chun Hsu, Ju-chieh Chou, Hung-Yi Lee, Lin-shan Lee

In this way, the length constraint mentioned above is removed to offer rhythm-flexible voice conversion without requiring parallel data.

Sound Audio and Speech Processing

How to Estimate Model Transferability of Pre-Trained Speech Models?

1 code implementation1 Jun 2023 Zih-Ching Chen, Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Shuo-Yiin Chang, Rohit Prabhavalkar, Hung-Yi Lee, Tara N. Sainath

In this work, we introduce a "score-based assessment" framework for estimating the transferability of pre-trained speech models (PSMs) for fine-tuning target tasks.

Understanding Self-Attention of Self-Supervised Audio Transformers

2 code implementations5 Jun 2020 Shu-wen Yang, Andy T. Liu, Hung-Yi Lee

Self-supervised Audio Transformers (SAT) enable great success in many downstream speech applications like ASR, but how they work has not been widely explored yet.

Cascading and Direct Approaches to Unsupervised Constituency Parsing on Spoken Sentences

1 code implementation15 Mar 2023 Yuan Tseng, Cheng-I Lai, Hung-Yi Lee

The goal is to determine the spoken sentences' hierarchical syntactic structure in the form of constituency parse trees, such that each node is a span of audio that corresponds to a constituent.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

MiniSUPERB: Lightweight Benchmark for Self-supervised Speech Models

1 code implementation30 May 2023 Yu-Hsiang Wang, Huang-Yu Chen, Kai-Wei Chang, Winston Hsu, Hung-Yi Lee

In this paper, we introduce MiniSUPERB, a lightweight benchmark that efficiently evaluates SSL speech models with comparable results to SUPERB but lower computational costs significantly.

Self-Supervised Learning

Why We Should Report the Details in Subjective Evaluation of TTS More Rigorously

1 code implementation3 Jun 2023 Cheng-Han Chiang, Wei-Ping Huang, Hung-Yi Lee

This paper emphasizes the importance of reporting experiment details in subjective evaluations and demonstrates how such details can significantly impact evaluation results in the field of speech synthesis.

Speech Synthesis

On the Transferability of Pre-trained Language Models: A Study from Artificial Datasets

1 code implementation8 Sep 2021 Cheng-Han Chiang, Hung-Yi Lee

In this work, we study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.

Compressing Transformer-based self-supervised models for speech processing

1 code implementation17 Nov 2022 Tzu-Quan Lin, Tsung-Huan Yang, Chun-Yao Chang, Kuang-Ming Chen, Tzu-hsun Feng, Hung-Yi Lee, Hao Tang

Despite the success of Transformers in self- supervised learning with applications to various downstream tasks, the computational cost of training and inference remains a major challenge for applying these models to a wide spectrum of devices.

Knowledge Distillation Model Compression +1

Zero Resource Code-switched Speech Benchmark Using Speech Utterance Pairs For Multiple Spoken Languages

1 code implementation4 Oct 2023 Kuan-Po Huang, Chih-Kai Yang, Yu-Kuan Fu, Ewan Dunbar, Hung-Yi Lee

We introduce a new zero resource code-switched speech benchmark designed to directly assess the code-switching capabilities of self-supervised speech encoders.

Language Modelling

Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations

1 code implementation8 Feb 2024 Cheng-Han Chiang, Hung-Yi Lee

We show that LLMs can generate paragraphs that contain verifiable facts, but the facts are combined to form a non-factual paragraph due to entity ambiguity.

Query-by-example Spoken Term Detection using Attention-based Multi-hop Networks

no code implementations1 Sep 2017 Chia-Wei Ao, Hung-Yi Lee

Retrieving spoken content with spoken queries, or query-by- example spoken term detection (STD), is attractive because it makes possible the matching of signals directly on the acoustic level without transcribing them into text.

Supervised and Unsupervised Transfer Learning for Question Answering

no code implementations NAACL 2018 Yu-An Chung, Hung-Yi Lee, James Glass

Although transfer learning has been shown to be successful for tasks like object and speech recognition, its applicability to question answering (QA) has yet to be well-studied.

Question Answering speech-recognition +2

Scalable Sentiment for Sequence-to-sequence Chatbot Response with Performance Analysis

no code implementations7 Apr 2018 Chih-Wei Lee, Yau-Shian Wang, Tsung-Yuan Hsu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee

Conventional seq2seq chatbot models only try to find the sentences with the highest probabilities conditioned on the input sequences, without considering the sentiment of the output sentences.

Chatbot reinforcement-learning +1

Joint Learning of Interactive Spoken Content Retrieval and Trainable User Simulator

no code implementations1 Apr 2018 Pei-Hung Chung, Kuan Tung, Ching-Lun Tai, Hung-Yi Lee

User-machine interaction is crucial for information retrieval, especially for spoken content retrieval, because spoken content is difficult to browse, and speech recognition has a high degree of uncertainty.

Information Retrieval Q-Learning +3

Completely Unsupervised Phoneme Recognition by Adversarially Learning Mapping Relationships from Audio Embeddings

no code implementations1 Apr 2018 Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee

Unsupervised discovery of acoustic tokens from audio corpora without annotation and learning vector representations for these tokens have been widely studied.

Generative Adversarial Network

Mitigating the Impact of Speech Recognition Errors on Chatbot using Sequence-to-Sequence Model

no code implementations22 Sep 2017 Pin-Jung Chen, I-Hung Hsu, Yi-Yao Huang, Hung-Yi Lee

We apply sequence-to-sequence model to mitigate the impact of speech recognition errors on open domain end-to-end dialog generation.

Chatbot Domain Adaptation +2

Order-Preserving Abstractive Summarization for Spoken Content Based on Connectionist Temporal Classification

no code implementations16 Sep 2017 Bo-Ru Lu, Frank Shyu, Yun-Nung Chen, Hung-Yi Lee, Lin-shan Lee

Connectionist temporal classification (CTC) is a powerful approach for sequence-to-sequence learning, and has been popularly used in speech recognition.

Abstractive Text Summarization General Classification +2

Personalized word representations Carrying Personalized Semantics Learned from Social Network Posts

no code implementations29 Oct 2017 Zih-Wei Lin, Tzu-Wei Sung, Hung-Yi Lee, Lin-shan Lee

In this framework, universal background word vectors are first learned from the background corpora, and then adapted by the personalized corpus for each individual user to learn the personalized word vectors.

Sentence Sentence Completion

Language Transfer of Audio Word2Vec: Learning Audio Segment Representations without Target Language Data

no code implementations19 Jul 2017 Chia-Hao Shen, Janet Y. Sung, Hung-Yi Lee

We train SA from one language (source language) and use it to extract the vector representation of the audio segments of another language (target language).

Hierarchical Attention Model for Improved Machine Comprehension of Spoken Content

no code implementations28 Aug 2016 Wei Fang, Jui-Yang Hsu, Hung-Yi Lee, Lin-shan Lee

Multimedia or spoken content presents more attractive information than plain text content, but the former is more difficult to display on a screen and be selected by a user.

Reading Comprehension

Abstractive Headline Generation for Spoken Content by Attentive Recurrent Neural Networks with ASR Error Modeling

no code implementations26 Dec 2016 Lang-Chi Yu, Hung-Yi Lee, Lin-shan Lee

In this way, the model for abstractive headline generation for spoken content can be learned from abundant text data and the ASR data for some recognizers.

Abstractive Text Summarization Document Summarization +1

Interactive Spoken Content Retrieval by Deep Reinforcement Learning

no code implementations16 Sep 2016 Yen-chen Wu, Tzu-Hsiang Lin, Yang-De Chen, Hung-Yi Lee, Lin-shan Lee

In our previous work, some hand-crafted states estimated from the present retrieval results are used to determine the proper actions.

Q-Learning reinforcement-learning +4

Towards Machine Comprehension of Spoken Content: Initial TOEFL Listening Comprehension Test by Machine

no code implementations23 Aug 2016 Bo-Hsiang Tseng, Sheng-syun Shen, Hung-Yi Lee, Lin-shan Lee

Multimedia or spoken content presents more attractive information than plain text content, but it's more difficult to display on a screen and be selected by a user.

Reading Comprehension Sentence

An Iterative Deep Learning Framework for Unsupervised Discovery of Speech Features and Linguistic Units with Applications on Spoken Term Detection

no code implementations1 Feb 2016 Cheng-Tao Chung, Cheng-Yu Tsai, Hsiang-Hung Lu, Chia-Hsiang Liu, Hung-Yi Lee, Lin-shan Lee

The multiple sets of token labels are then used as the targets of a Multi-target Deep Neural Network (MDNN) trained on low-level acoustic features.

Towards Structured Deep Neural Network for Automatic Speech Recognition

no code implementations8 Nov 2015 Yi-Hsiu Liao, Hung-Yi Lee, Lin-shan Lee

In this paper we propose the Structured Deep Neural Network (structured DNN) as a structured and deep learning framework.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Towards Structured Deep Neural Network for Automatic Speech Recognition

no code implementations3 Jun 2015 Yi-Hsiu Liao, Hung-Yi Lee, Lin-shan Lee

In this paper we propose the Structured Deep Neural Network (Structured DNN) as a structured and deep learning algorithm, learning to find the best structured object (such as a label sequence) given a structured input (such as a vector sequence) by globally considering the mapping relationships between the structure rather than item by item.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Segmental Audio Word2Vec: Representing Utterances as Sequences of Vectors with Applications in Spoken Term Detection

no code implementations7 Aug 2018 Yu-Hsuan Wang, Hung-Yi Lee, Lin-shan Lee

In this paper, we extend audio Word2Vec from word-level to utterance-level by proposing a new segmental audio Word2Vec, in which unsupervised spoken word boundary segmentation and audio Word2Vec are jointly learned and mutually enhanced, so an utterance can be directly represented as a sequence of vectors carrying phonetic structure information.

Segmentation

Proximal Policy Optimization and its Dynamic Version for Sequence Generation

no code implementations24 Aug 2018 Yi-Lin Tuan, Jinzhi Zhang, Yujia Li, Hung-Yi Lee

In sequence generation task, many works use policy gradient for model optimization to tackle the intractable backpropagation issue when maximizing the non-differentiable evaluation metrics or fooling the discriminator in adversarial learning.

Chatbot Model Optimization +2

Adversarial Training of End-to-end Speech Recognition Using a Criticizing Language Model

no code implementations2 Nov 2018 Alexander H. Liu, Hung-Yi Lee, Lin-shan Lee

In this paper we proposed a novel Adversarial Training (AT) approach for end-to-end speech recognition using a Criticizing Language Model (CLM).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection

no code implementations7 Nov 2018 Sung-Feng Huang, Yi-Chen Chen, Hung-Yi Lee, Lin-shan Lee

Embedding audio signal segments into vectors with fixed dimensionality is attractive because all following processing will be easier and more efficient, for example modeling, classifying or indexing.

Clustering

Completely Unsupervised Speech Recognition By A Generative Adversarial Network Harmonized With Iteratively Refined Hidden Markov Models

no code implementations8 Apr 2019 Kuan-Yu Chen, Che-Ping Tsai, Da-Rong Liu, Hung-Yi Lee, Lin-shan Lee

Producing a large annotated speech corpus for training ASR systems remains difficult for more than 95% of languages all over the world which are low-resourced, but collecting a relatively big unlabeled data set for such languages is more achievable.

Generative Adversarial Network speech-recognition +2

From Semi-supervised to Almost-unsupervised Speech Recognition with Very-low Resource by Jointly Learning Phonetic Structures from Audio and Text Embeddings

no code implementations10 Apr 2019 Yi-Chen Chen, Sung-Feng Huang, Hung-Yi Lee, Lin-shan Lee

However, we note human babies start to learn the language by the sounds (or phonetic structures) of a small number of exemplar words, and "generalize" such knowledge to other words without hearing a large amount of data.

speech-recognition Speech Recognition +1

End-to-end Text-to-speech for Low-resource Languages by Cross-Lingual Transfer Learning

no code implementations13 Apr 2019 Tao Tu, Yuan-Jui Chen, Cheng-chieh Yeh, Hung-Yi Lee

In this paper, we aim to build TTS systems for such low-resource (target) languages where only very limited paired data are available.

Cross-Lingual Transfer Transfer Learning

Cross-Lingual Transfer Learning for Question Answering

no code implementations13 Jul 2019 Chia-Hsuan Lee, Hung-Yi Lee

In this paper, we explore the problem of cross-lingual transfer learning for QA, where a source language task with plentiful annotations is utilized to improve the performance of a QA model on a target language task with limited available annotations.

Cross-Lingual Transfer Machine Translation +4

SpeechBERT: An Audio-and-text Jointly Learned Language Model for End-to-end Spoken Question Answering

no code implementations25 Oct 2019 Yung-Sung Chuang, Chi-Liang Liu, Hung-Yi Lee, Lin-shan Lee

In addition to the potential of end-to-end SQA, the SpeechBERT can also be considered for many other spoken language understanding tasks just as BERT for many text processing tasks.

Language Modelling Question Answering +2

Towards Unsupervised Speech Recognition and Synthesis with Quantized Speech Representation Learning

no code implementations28 Oct 2019 Alexander H. Liu, Tao Tu, Hung-Yi Lee, Lin-shan Lee

In this paper we propose a Sequential Representation Quantization AutoEncoder (SeqRQ-AE) to learn from primarily unpaired audio data and produce sequences of representations very close to phoneme sequences of speech utterances.

Clustering Quantization +4

Training a code-switching language model with monolingual data

no code implementations14 Nov 2019 Shun-Po Chuang, Tzu-Wei Sung, Hung-Yi Lee

A lack of code-switching data complicates the training of code-switching (CS) language models.

Language Modelling Translation +1

Towards Robust Neural Vocoding for Speech Generation: A Survey

no code implementations5 Dec 2019 Po-chun Hsu, Chun-hsuan Wang, Andy T. Liu, Hung-Yi Lee

We found out that the speaker variety is much more important for achieving a universal vocoder than the language.

Speech Synthesis Voice Conversion

MITAS: A Compressed Time-Domain Audio Separation Network with Parameter Sharing

no code implementations9 Dec 2019 Chao-I Tuan, Yuan-Kuei Wu, Hung-Yi Lee, Yu Tsao

Our experimental results first confirmed the robustness of our MiTAS on two types of perturbations in mixed audio.

Speech Separation

BERT's output layer recognizes all hidden layers? Some Intriguing Phenomena and a simple way to boost BERT

no code implementations25 Jan 2020 Wei-Tsung Kao, Tsung-Han Wu, Po-Han Chi, Chun-Cheng Hsieh, Hung-Yi Lee

Although Bidirectional Encoder Representations from Transformers (BERT) have achieved tremendous success in many natural language processing (NLP) tasks, it remains a black box.

Sentence

Defense against adversarial attacks on spoofing countermeasures of ASV

no code implementations6 Mar 2020 Haibin Wu, Songxiang Liu, Helen Meng, Hung-Yi Lee

Various forefront countermeasure methods for automatic speaker verification (ASV) with considerable performance in anti-spoofing are proposed in the ASVspoof 2019 challenge.

Speaker Verification

A Study of Cross-Lingual Ability and Language-specific Information in Multilingual BERT

no code implementations20 Apr 2020 Chi-Liang Liu, Tsung-Yuan Hsu, Yung-Sung Chuang, Hung-Yi Lee

Recently, multilingual BERT works remarkably well on cross-lingual transfer tasks, superior to static non-contextualized word embeddings.

Cross-Lingual Transfer Translation +1

End-to-end Whispered Speech Recognition with Frequency-weighted Approaches and Pseudo Whisper Pre-training

no code implementations5 May 2020 Heng-Jui Chang, Alexander H. Liu, Hung-Yi Lee, Lin-shan Lee

Whispering is an important mode of human speech, but no end-to-end recognition results for it were reported yet, probably due to the scarcity of available whispered speech data.

speech-recognition Speech Recognition +1

DARTS-ASR: Differentiable Architecture Search for Multilingual Speech Recognition and Adaptation

no code implementations13 May 2020 Yi-Chen Chen, Jui-Yang Hsu, Cheng-Kuang Lee, Hung-Yi Lee

In order to examine the generalizability of DARTS-ASR, we apply our approach not only on many languages to perform monolingual ASR, but also on a multilingual ASR setting.

speech-recognition Speech Recognition

Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation

no code implementations16 May 2020 Tao Tu, Yuan-Jui Chen, Alexander H. Liu, Hung-Yi Lee

The experiment results demonstrate that with only an hour of paired speech data, no matter the paired data is from multiple speakers or a single speaker, the proposed model can generate intelligible speech in different voices.

Speech Synthesis Text-To-Speech Synthesis

Investigation of Sentiment Controllable Chatbot

no code implementations11 Jul 2020 Hung-Yi Lee, Cheng-Hao Ho, Chien-Fu Lin, Chiung-Chih Chang, Chih-Wei Lee, Yau-Shian Wang, Tsung-Yuan Hsu, Kuan-Yu Chen

Conventional seq2seq chatbot models attempt only to find sentences with the highest probabilities conditioned on the input sequences, without considering the sentiment of the output sentences.

Chatbot reinforcement-learning +1

What makes multilingual BERT multilingual?

no code implementations20 Oct 2020 Chi-Liang Liu, Tsung-Yuan Hsu, Yung-Sung Chuang, Hung-Yi Lee

Recently, multilingual BERT works remarkably well on cross-lingual transfer tasks, superior to static non-contextualized word embeddings.

Cross-Lingual Transfer Word Embeddings

Pre-Training a Language Model Without Human Language

no code implementations22 Dec 2020 Cheng-Han Chiang, Hung-Yi Lee

In this paper, we study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance.

Language Modelling

Is BERT a Cross-Disciplinary Knowledge Learner? A Surprising Finding of Pre-trained Models' Transferability

no code implementations12 Mar 2021 Wei-Tsung Kao, Hung-Yi Lee

This paper investigates whether the power of the models pre-trained on text data, such as BERT, can be transferred to general token sequence classification applications.

General Classification text-classification +1

Towards Lifelong Learning of End-to-end ASR

no code implementations4 Apr 2021 Heng-Jui Chang, Hung-Yi Lee, Lin-shan Lee

We can collect new data describing the new environment and fine-tune the system, but this naturally leads to higher error rates for the earlier datasets, referred to as catastrophic forgetting.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

How Far Are We from Robust Voice Conversion: A Survey

no code implementations24 Nov 2020 Tzu-Hsien Huang, Jheng-Hao Lin, Chien-yu Huang, Hung-Yi Lee

Voice conversion technologies have been greatly improved in recent years with the help of deep learning, but their capabilities of producing natural sounding utterances in different conditions remain unclear.

Speaker Identification Voice Conversion

Improving Cross-Lingual Reading Comprehension with Self-Training

no code implementations8 May 2021 Wei-Cheng Huang, Chien-yu Huang, Hung-Yi Lee

Substantial improvements have been made in machine reading comprehension, where the machine answers questions based on a given context.

Machine Reading Comprehension

Parallelized Reverse Curriculum Generation

no code implementations4 Aug 2021 Zih-Yun Chiu, Yi-Lin Tuan, Hung-Yi Lee, Li-Chen Fu

For reinforcement learning (RL), it is challenging for an agent to master a task that requires a specific series of actions due to sparse rewards.

Reinforcement Learning (RL)

Analyzing the Robustness of Unsupervised Speech Recognition

no code implementations7 Oct 2021 Guan-Ting Lin, Chan-Jan Hsu, Da-Rong Liu, Hung-Yi Lee, Yu Tsao

In this work, we further analyze the training robustness of unsupervised ASR on the domain mismatch scenarios in which the domains of unpaired speech and text are different.

Generative Adversarial Network speech-recognition +2

CheerBots: Chatbots toward Empathy and Emotionusing Reinforcement Learning

no code implementations8 Oct 2021 Jiun-Hao Jhan, Chao-Peng Liu, Shyh-Kang Jeng, Hung-Yi Lee

Apart from the coherence and fluency of responses, an empathetic chatbot emphasizes more on people's feelings.

Chatbot reinforcement-learning +2

Toward Degradation-Robust Voice Conversion

no code implementations14 Oct 2021 Chien-yu Huang, Kai-Wei Chang, Hung-Yi Lee

However, in real-world scenarios, it is difficult to collect clean utterances of a speaker, and they are usually degraded by noises or reverberations.

Denoising Speech Enhancement +1

Don't speak too fast: The impact of data bias on self-supervised speech models

no code implementations15 Oct 2021 Yen Meng, Yi-Hui Chou, Andy T. Liu, Hung-Yi Lee

Self-supervised Speech Models (S3Ms) have been proven successful in many speech downstream tasks, like ASR.

Is BERT a Cross-Disciplinary Knowledge Learner? A Surprising Finding of Pre-trained Models’ Transferability

no code implementations Findings (EMNLP) 2021 Wei-Tsung Kao, Hung-Yi Lee

This paper investigates whether the power of the models pre-trained on text data, such as BERT, can be transferred to general token sequence classification applications.

text-classification Text Classification

Characterizing the adversarial vulnerability of speech self-supervised learning

no code implementations8 Nov 2021 Haibin Wu, Bo Zheng, Xu Li, Xixin Wu, Hung-Yi Lee, Helen Meng

As the paradigm of the self-supervised learning upstream model followed by downstream tasks arouses more attention in the speech community, characterizing the adversarial robustness of such paradigm is of high priority.

Adversarial Robustness Benchmarking +2

TopicGAN: Unsupervised Text Generation from Explainable Latent Topics

no code implementations27 Sep 2018 Yau-Shian Wang, Yun-Nung Chen, Hung-Yi Lee

Learning discrete representations of data and then generating data from the discovered representations have been increasingly studied because the obtained discrete representations can benefit unsupervised learning.

Image Generation Text Generation

Spoofing-Aware Speaker Verification by Multi-Level Fusion

no code implementations29 Mar 2022 Haibin Wu, Lingwei Meng, Jiawen Kang, Jinchao Li, Xu Li, Xixin Wu, Hung-Yi Lee, Helen Meng

In the second-level fusion, the CM score and ASV scores directly from ASV systems will be concatenated into a prediction block for the final decision.

Speaker Verification

Improving Distortion Robustness of Self-supervised Speech Processing Tasks with Domain Adaptation

no code implementations30 Mar 2022 Kuan Po Huang, Yu-Kuan Fu, Yu Zhang, Hung-Yi Lee

Speech distortions are a long-standing problem that degrades the performance of supervisely trained speech processing models.

Domain Adaptation

Re-Examining Human Annotations for Interpretable NLP

no code implementations10 Apr 2022 Cheng-Han Chiang, Hung-Yi Lee

Our results reveal that the annotation quality is highly subject to the workers' qualification, and workers can be guided to provide certain annotations by the instructions.

Understanding, Detecting, and Separating Out-of-Distribution Samples and Adversarial Samples in Text Classification

no code implementations9 Apr 2022 Cheng-Han Chiang, Hung-Yi Lee

Based on our observations, we propose a simple method to separate ID, OOD, and Adv samples using the hidden representations and output probabilities of the model.

text-classification Text Classification

XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding

no code implementations ACL 2022 Chan-Jan Hsu, Hung-Yi Lee, Yu Tsao

Transformer-based models are widely used in natural language understanding (NLU) tasks, and multimodal transformers have been effective in visual-language tasks.

Natural Language Understanding

Self-Supervised Speech Representation Learning: A Review

no code implementations21 May 2022 Abdelrahman Mohamed, Hung-Yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, Tara N. Sainath, Shinji Watanabe

Although self-supervised speech representation is still a nascent research area, it is closely related to acoustic word embedding and learning with zero lexical resources, both of which have seen active research for many years.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Structured Prompt Tuning

no code implementations24 May 2022 Chi-Liang Liu, Hung-Yi Lee, Wen-tau Yih

We propose structured prompt tuning, a simple and effective method to improve prompt tuning.

Searching for the Essence of Adversarial Perturbations

no code implementations30 May 2022 Dennis Y. Menn, Tzu-hsun Feng, Hung-Yi Lee

Neural networks have demonstrated state-of-the-art performance in various machine learning fields.

Autonomous Driving

Tackling Spoofing-Aware Speaker Verification with Multi-Model Fusion

no code implementations18 Jun 2022 Haibin Wu, Jiawen Kang, Lingwei Meng, Yang Zhang, Xixin Wu, Zhiyong Wu, Hung-Yi Lee, Helen Meng

However, previous works show that state-of-the-art ASV models are seriously vulnerable to voice spoofing attacks, and the recently proposed high-performance spoofing countermeasure (CM) models only focus solely on the standalone anti-spoofing tasks, and ignore the subsequent speaker verification process.

Open-Ended Question Answering Speaker Verification

Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding

no code implementations27 Jun 2022 Wei-Ping Huang, Po-Chun Chen, Sung-Feng Huang, Hung-Yi Lee

This paper studies a transferable phoneme embedding framework that aims to deal with the cross-lingual text-to-speech (TTS) problem under the few-shot setting.

Few-Shot Learning Transfer Learning

Are Synonym Substitution Attacks Really Synonym Substitution Attacks?

no code implementations6 Oct 2022 Cheng-Han Chiang, Hung-Yi Lee

In this paper, we explore the following question: Are synonym substitution attacks really synonym substitution attacks (SSAs)?

Sentence

On Compressing Sequences for Self-Supervised Speech Models

no code implementations13 Oct 2022 Yen Meng, Hsuan-Jui Chen, Jiatong Shi, Shinji Watanabe, Paola Garcia, Hung-Yi Lee, Hao Tang

Subsampling while training self-supervised models not only improves the overall performance on downstream tasks under certain frame rates, but also brings significant speed-up in inference.

Self-Supervised Learning

M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval

no code implementations2 Nov 2022 Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-Yi Lee, David Harwath

This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.

Image Retrieval Retrieval +1

Once-for-All Sequence Compression for Self-Supervised Speech Models

no code implementations4 Nov 2022 Hsuan-Jui Chen, Yen Meng, Hung-Yi Lee

The sequence length along the time axis is often the dominant factor of the computation in speech processing.

Bridging Speech and Textual Pre-trained Models with Unsupervised ASR

no code implementations6 Nov 2022 Jiatong Shi, Chan-Jan Hsu, Holam Chung, Dongji Gao, Paola Garcia, Shinji Watanabe, Ann Lee, Hung-Yi Lee

To be specific, we propose to use unsupervised automatic speech recognition (ASR) as a connector that bridges different modalities used in speech and textual pre-trained models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Model Extraction Attack against Self-supervised Speech Models

no code implementations29 Nov 2022 Tsu-Yuan Hsu, Chen-An Li, Tung-Yu Wu, Hung-Yi Lee

In the first stage, SSL is conducted on the large-scale unlabeled corpus to pre-train a small speech model.

Model extraction Self-Supervised Learning

General Framework for Self-Supervised Model Priming for Parameter-Efficient Fine-tuning

no code implementations2 Dec 2022 Shih-Cheng Huang, Shih-Heng Wang, Min-Han Shih, Saurav Sahay, Hung-Yi Lee

To tackle these issues, we propose a general framework to enhance the few-shot adaptation and cross-domain generalization ability of parameter-efficient methods.

Domain Generalization

CHAPTER: Exploiting Convolutional Neural Network Adapters for Self-supervised Speech Models

no code implementations1 Dec 2022 Zih-Ching Chen, Yu-Shun Sung, Hung-Yi Lee

However, such efficient tuning techniques only provide adaptation at the transformer layer, but failed to perform adaptation at the feature extractor.

Self-Supervised Learning

SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks

no code implementations20 Dec 2022 Suwon Shon, Siddhant Arora, Chyi-Jiunn Lin, Ankita Pasad, Felix Wu, Roshan Sharma, Wei-Lun Wu, Hung-Yi Lee, Karen Livescu, Shinji Watanabe

In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape.

Dialog Act Classification Question Answering +4

Hierarchical Programmatic Reinforcement Learning via Learning to Compose Programs

no code implementations30 Jan 2023 Guan-Ting Liu, En-Pei Hu, Pu-Jen Cheng, Hung-Yi Lee, Shao-Hua Sun

Aiming to produce reinforcement learning (RL) policies that are human-interpretable and can generalize better to novel scenarios, Trivedi et al. (2021) present a method (LEAPS) that first learns a program embedding space to continuously parameterize diverse programs from a pre-generated program dataset, and then searches for a task-solving program in the learned program embedding space when given a task.

reinforcement-learning Reinforcement Learning (RL)

Ensemble knowledge distillation of self-supervised speech models

no code implementations24 Feb 2023 Kuan-Po Huang, Tzu-hsun Feng, Yu-Kuan Fu, Tsu-Yuan Hsu, Po-Chieh Yen, Wei-Cheng Tseng, Kai-Wei Chang, Hung-Yi Lee

We tried two different aggregation techniques, layerwise-average and layerwise-concatenation, to the representations of different teacher models and found that the former was more effective.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

SpeechPrompt v2: Prompt Tuning for Speech Classification Tasks

no code implementations1 Mar 2023 Kai-Wei Chang, Yu-Kai Wang, Hua Shen, Iu-thing Kang, Wei-Cheng Tseng, Shang-Wen Li, Hung-Yi Lee

For speech processing, SpeechPrompt shows its high parameter efficiency and competitive performance on a few speech classification tasks.

Ranked #17 on Spoken Language Understanding on Fluent Speech Commands (using extra training data)

Classification Language Modelling +1

Can Large Language Models Be an Alternative to Human Evaluations?

no code implementations3 May 2023 Cheng-Han Chiang, Hung-Yi Lee

We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation: the texts rated higher by human experts are also rated higher by the LLMs.

Story Generation

ML-SUPERB: Multilingual Speech Universal PERformance Benchmark

no code implementations18 May 2023 Jiatong Shi, Dan Berrebbi, William Chen, Ho-Lam Chung, En-Pei Hu, Wei Ping Huang, Xuankai Chang, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Shinji Watanabe

Speech processing Universal PERformance Benchmark (SUPERB) is a leaderboard to benchmark the performance of Self-Supervised Learning (SSL) models on various speech processing tasks.

Automatic Speech Recognition Language Identification +3

Cannot find the paper you are looking for? You can Submit a new open access paper.