1 code implementation • NAACL (ACL) 2022 • Hung-Yi Lee, Abdelrahman Mohamed, Shinji Watanabe, Tara Sainath, Karen Livescu, Shang-Wen Li, Shu-wen Yang, Katrin Kirchhoff
Due to the growing popularity of SSL, and the shared mission of the areas in bringing speech and language technologies to more use cases with better quality and scaling the technologies for under-represented languages, we propose this tutorial to systematically survey the latest SSL techniques, tools, datasets, and performance achievement in speech processing.
no code implementations • 23 Aug 2024 • Zhenyu Wang, Li Wan, Biqiao Zhang, Yiteng Huang, Shang-Wen Li, Ming Sun, Xin Lei, Zhaojun Yang
A keyword spotting (KWS) engine that is continuously running on device is exposed to various speech signals that are usually unseen before.
no code implementations • 23 Aug 2024 • Kai-Wei Chang, Haibin Wu, Yu-Kai Wang, Yuan-Kuei Wu, Hua Shen, Wei-Cheng Tseng, Iu-thing Kang, Shang-Wen Li, Hung-Yi Lee
Motivated by the strengths of prompting, we are the first to explore the potential of prompting speech LMs in the domain of speech processing.
no code implementations • 26 Apr 2024 • Vasu Sharma, Karthik Padthe, Newsha Ardalani, Kushal Tirumala, Russell Howes, Hu Xu, Po-Yao Huang, Shang-Wen Li, Armen Aghajanyan, Gargi Ghosh, Luke Zettlemoyer
In recent times training Language Models (LMs) have relied on computationally heavy training over massive datasets which makes this training process extremely laborious.
1 code implementation • CVPR 2024 • Jiawei Ma, Po-Yao Huang, Saining Xie, Shang-Wen Li, Luke Zettlemoyer, Shih-Fu Chang, Wen-tau Yih, Hu Xu
The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data.
1 code implementation • 15 Apr 2024 • Shu-wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li, Abdelrahman Mohamed, Shinji Watanabe, Hung-Yi Lee
In this work, we establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the paradigm for speech.
1 code implementation • 25 Mar 2024 • Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, David Harwath
We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts.
no code implementations • 24 Jan 2024 • Chyi-Jiunn Lin, Guan-Ting Lin, Yung-Sung Chuang, Wei-Lun Wu, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Lin-shan Lee
However, the real-world problem of Open-domain SQA (openSQA), in which the machine needs to first retrieve passages that possibly contain the answer from a spoken archive in addition, was never considered.
no code implementations • 15 Dec 2023 • Min-Han Shih, Ho-Lam Chung, Yu-Chi Pai, Ming-Hao Hsu, Guan-Ting Lin, Shang-Wen Li, Hung-Yi Lee
Furthermore, the GSQA model has only been fine-tuned on the spoken extractive QA dataset.
no code implementations • 2 Nov 2023 • Ching-Feng Yeh, Po-Yao Huang, Vasu Sharma, Shang-Wen Li, Gargi Gosh
We propose Fast Language-Audio Pre-training (FLAP), a self-supervised approach that efficiently and effectively learns aligned audio and language representations through masking, contrastive learning and reconstruction.
no code implementations • 19 Oct 2023 • Ming-Hao Hsu, Kai-Wei Chang, Shang-Wen Li, Hung-Yi Lee
Despite the success of ICL in NLP, little work is exploring the possibility of ICL in speech processing.
1 code implementation • 16 Oct 2023 • Cheol Jun Cho, Abdelrahman Mohamed, Shang-Wen Li, Alan W Black, Gopala K. Anumanchipalli
Data-driven unit discovery in self-supervised learning (SSL) of speech has embarked on a new era of spoken language processing.
no code implementations • 9 Oct 2023 • Jiatong Shi, William Chen, Dan Berrebbi, Hsiu-Hsuan Wang, Wei-Ping Huang, En-Pei Hu, Ho-Lam Chuang, Xuankai Chang, Yuxun Tang, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Shinji Watanabe
The 2023 Multilingual Speech Universal Performance Benchmark (ML-SUPERB) Challenge expands upon the acclaimed SUPERB framework, emphasizing self-supervised models in multilingual speech recognition and language identification.
no code implementations • 4 Oct 2023 • Kai-Wei Chang, Ming-Hsin Chen, Yun-Ping Lin, Jing Neng Hsu, Paul Kuo-Ming Huang, Chien-yu Huang, Shang-Wen Li, Hung-Yi Lee
Notably, in the low-resource scenario, prompting consistently outperforms adapter tuning.
2 code implementations • 28 Sep 2023 • Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer
We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective.
1 code implementation • 19 Sep 2023 • Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-Yi Lee
Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information.
1 code implementation • 5 Sep 2023 • Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, Candace Ross, Adam Polyak, Russell Howes, Vasu Sharma, Puxin Xu, Hovhannes Tamoyan, Oron Ashual, Uriel Singer, Shang-Wen Li, Susan Zhang, Richard James, Gargi Ghosh, Yaniv Taigman, Maryam Fazel-Zarandi, Asli Celikyilmaz, Luke Zettlemoyer, Armen Aghajanyan
It is also a general-purpose model that can do both text-to-image and image-to-text generation, allowing us to introduce self-contained contrastive decoding methods that produce high-quality outputs.
Ranked #2 on Text-to-Image Generation on MS COCO
no code implementations • 29 May 2023 • Guan-Wei Wu, Guan-Ting Lin, Shang-Wen Li, Hung-Yi Lee
However, the absence of intermediate targets and training guidance for textless SLU often results in suboptimal performance.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
1 code implementation • 26 May 2023 • Yung-Sung Chuang, Wei Fang, Shang-Wen Li, Wen-tau Yih, James Glass
We propose EAR, a query Expansion And Reranking approach for improving passage retrieval, with the application to open-domain question answering.
2 code implementations • 19 May 2023 • Puyuan Peng, Shang-Wen Li, Okko Räsänen, Abdelrahman Mohamed, David Harwath
In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective.
no code implementations • 18 May 2023 • Jiatong Shi, Dan Berrebbi, William Chen, Ho-Lam Chung, En-Pei Hu, Wei Ping Huang, Xuankai Chang, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Shinji Watanabe
Speech processing Universal PERformance Benchmark (SUPERB) is a leaderboard to benchmark the performance of Self-Supervised Learning (SSL) models on various speech processing tasks.
20 code implementations • 14 Apr 2023 • Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision.
Ranked #1 on Image Retrieval on AmsterTime (using extra training data)
no code implementations • 1 Mar 2023 • Kai-Wei Chang, Yu-Kai Wang, Hua Shen, Iu-thing Kang, Wei-Cheng Tseng, Shang-Wen Li, Hung-Yi Lee
For speech processing, SpeechPrompt shows its high parameter efficiency and competitive performance on a few speech classification tasks.
Ranked #17 on Spoken Language Understanding on Fluent Speech Commands (using extra training data)
1 code implementation • NeurIPS 2023 • Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, Haoqi Fan, Yanghao Li, Shang-Wen Li, Gargi Ghosh, Jitendra Malik, Christoph Feichtenhofer
We present Masked Audio-Video Learners (MAViL) to train audio-visual representations.
no code implementations • 15 Nov 2022 • Derek Xu, Shuyan Dong, Changhan Wang, Suyoun Kim, Zhaojiang Lin, Akshat Shrivastava, Shang-Wen Li, Liang-Hsuan Tseng, Alexei Baevski, Guan-Ting Lin, Hung-Yi Lee, Yizhou Sun, Wei Wang
Recent studies find existing self-supervised speech encoders contain primarily acoustic rather than semantic information.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +10
no code implementations • 16 Oct 2022 • Tzu-hsun Feng, Annie Dong, Ching-Feng Yeh, Shu-wen Yang, Tzu-Quan Lin, Jiatong Shi, Kai-Wei Chang, Zili Huang, Haibin Wu, Xuankai Chang, Shinji Watanabe, Abdelrahman Mohamed, Shang-Wen Li, Hung-Yi Lee
We present the SUPERB challenge at SLT 2022, which aims at learning self-supervised speech representation for better performance, generalization, and efficiency.
no code implementations • 10 Oct 2022 • Zih-Ching Chen, Chin-Lun Fu, Chih-Ying Liu, Shang-Wen Li, Hung-Yi Lee
In downstream tasks, the parameters of SSL models are frozen, and only the adapters are trained.
no code implementations • 21 May 2022 • Abdelrahman Mohamed, Hung-Yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, Tara N. Sainath, Shinji Watanabe
Although self-supervised speech representation is still a nascent research area, it is closely related to acoustic word embedding and learning with zero lexical resources, both of which have seen active research for many years.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • NAACL 2022 • Hung-Yi Lee, Shang-Wen Li, Ngoc Thang Vu
Deep learning has been the mainstream technique in natural language processing (NLP) area.
1 code implementation • NAACL 2022 • Yung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljačić, Shang-Wen Li, Wen-tau Yih, Yoon Kim, James Glass
We propose DiffCSE, an unsupervised contrastive learning framework for learning sentence embeddings.
Ranked #13 on Semantic Textual Similarity on STS16
1 code implementation • 31 Mar 2022 • Kai-Wei Chang, Wei-Cheng Tseng, Shang-Wen Li, Hung-Yi Lee
We report in this paper the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM).
2 code implementations • 27 Mar 2022 • Guan-Ting Lin, Shang-Wen Li, Hung-Yi Lee
Although deep learning-based end-to-end Automatic Speech Recognition (ASR) has shown remarkable performance in recent years, it suffers severe performance regression on test samples drawn from different data distributions.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
1 code implementation • ACL 2022 • Hsiang-Sheng Tsai, Heng-Jui Chang, Wen-Chin Huang, Zili Huang, Kushal Lakhotia, Shu-wen Yang, Shuyan Dong, Andy T. Liu, Cheng-I Jeff Lai, Jiatong Shi, Xuankai Chang, Phil Hall, Hsuan-Jui Chen, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-Yi Lee
In this paper, we introduce SUPERB-SG, a new benchmark focused on evaluating the semantic and generative capabilities of pre-trained models by increasing task diversity and difficulty over SUPERB.
1 code implementation • 9 Mar 2022 • Guan-Ting Lin, Yung-Sung Chuang, Ho-Lam Chung, Shu-wen Yang, Hsuan-Jui Chen, Shuyan Dong, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Lin-shan Lee
We empirically showed that DUAL yields results comparable to those obtained by cascading ASR and text QA model and robust to real-world data.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
1 code implementation • 3 Mar 2022 • Andy T. Liu, Wei Xiao, Henghui Zhu, Dejiao Zhang, Shang-Wen Li, Andrew Arnold
Recently, prompt-based learning for pre-trained language models has succeeded in few-shot Named Entity Recognition (NER) by exploiting prompts as task guidance to increase label efficiency.
no code implementations • NAACL 2022 • Xisen Jin, Dejiao Zhang, Henghui Zhu, Wei Xiao, Shang-Wen Li, Xiaokai Wei, Andrew Arnold, Xiang Ren
We evaluate PTLM's ability to adapt to new corpora while retaining learned knowledge in earlier corpora.
1 code implementation • EMNLP 2021 • Dejiao Zhang, Shang-Wen Li, Wei Xiao, Henghui Zhu, Ramesh Nallapati, Andrew O. Arnold, Bing Xiang
Many recent successes in sentence representation learning have been achieved by simply fine-tuning on the Natural Language Inference (NLI) datasets with triplet loss or siamese loss.
no code implementations • ACL 2021 • Hung-Yi Lee, Ngoc Thang Vu, Shang-Wen Li
Meta-learning is one of the most important new techniques in machine learning in recent years.
1 code implementation • ACL (WOAH) 2021 • Yung-Sung Chuang, Mingye Gao, Hongyin Luo, James Glass, Hung-Yi Lee, Yun-Nung Chen, Shang-Wen Li
Automatic detection of toxic language plays an essential role in protecting social media users, especially minority groups, from verbal abuse.
no code implementations • 6 Jun 2021 • Hongyin Luo, Shuyan Dong, Yung-Sung Chuang, Shang-Wen Li
Neural network pretraining is gaining attention due to its outstanding performance in natural language processing applications.
6 code implementations • 3 May 2021 • Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-Yi Lee
SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data.
1 code implementation • NAACL 2022 • Hongyin Luo, Shang-Wen Li, Mingye Gao, Seunghak Yu, James Glass
Pretrained language models have significantly improved the performance of downstream language understanding tasks, including extractive question answering, by providing high-quality contextualized word embeddings.
Ranked #1 on Question Answering on MRQA out-of-domain
Extractive Question-Answering Machine Reading Comprehension +6
2 code implementations • EMNLP (ClinicalNLP) 2020 • Hongyin Luo, Shang-Wen Li, James Glass
Given a set of explicit symptoms provided by the patient to initiate a dialog for diagnosing, the system is trained to collect implicit symptoms by asking questions, in order to collect more information for making an accurate diagnosis.
no code implementations • EACL 2021 • Shuyang Li, Jin Cao, Mukund Sridhar, Henghui Zhu, Shang-Wen Li, Wael Hamza, Julian McAuley
Dialog State Tracking (DST), an integral part of modern dialog systems, aims to track user preferences and constraints (slots) in task-oriented dialogs.
no code implementations • 31 Dec 2020 • Shang-Wen Li
By linking and organizing pieces of learning content scattered in various course materials into an easily accessible structure, we hypothesize that this framework can provide learners guidance and improve content navigation.
no code implementations • 30 Nov 2020 • Shang-Wen Li, Jason Krone, Shuyan Dong, Yi Zhang, Yaser Al-Onaizan
Recently deep learning has dominated many machine learning areas, including spoken language understanding (SLU).
no code implementations • 11 Nov 2020 • Cheng-I Lai, Jin Cao, Sravan Bodapati, Shang-Wen Li
Much recent work on Spoken Language Understanding (SLU) falls short in at least one of three ways: models were trained on oracle text input and neglected the Automatics Speech Recognition (ASR) outputs, models were trained to predict only intents without the slot values, or models were trained on a large amount of in-house data.
1 code implementation • 26 Oct 2020 • Cheng-I Lai, Yung-Sung Chuang, Hung-Yi Lee, Shang-Wen Li, James Glass
Much recent work on Spoken Language Understanding (SLU) is limited in at least one of three ways: models were trained on oracle text input and neglected ASR errors, models were trained to predict only intents without the slot values, or models were trained on a large amount of in-house data.
no code implementations • 9 Oct 2020 • Jin Cao, Jun Wang, Wael Hamza, Kelly Vanee, Shang-Wen Li
The light encoder architecture separates the shared pre-trained networks from the mappings of generally encoded knowledge to specific domains of SLU, allowing for the domain adaptation to be performed solely at the light encoder and thus increasing efficiency.
7 code implementations • 12 Jul 2020 • Andy T. Liu, Shang-Wen Li, Hung-Yi Lee
We present a large-scale comparison of various self-supervised models.
no code implementations • 19 May 2020 • Hongyin Luo, Shang-Wen Li, James Glass
Experiments showed that the ProtoQN significantly outperformed the baseline DQN model in both supervised and few-shot learning scenarios, and achieves state-of-the-art few-shot learning performances.
4 code implementations • 18 May 2020 • Po-Han Chi, Pei-Hung Chung, Tsung-Han Wu, Chun-Cheng Hsieh, Yen-Hao Chen, Shang-Wen Li, Hung-Yi Lee
We use the representations with two downstream tasks, speaker identification, and phoneme classification.
no code implementations • 11 Dec 2017 • Maryam Fazel-Zarandi, Shang-Wen Li, Jin Cao, Jared Casale, Peter Henderson, David Whitney, Alborz Geramifard
In this paper, we focus on learning robust dialog policies to recover from these errors.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
1 code implementation • 3 Jul 2016 • Yuzhuo Ren, Chen Chen, Shang-Wen Li, C. -C. Jay Kuo
The task of estimating the spatial layout of cluttered indoor scenes from a single RGB image is addressed in this work.
no code implementations • 3 Apr 2016 • Yuzhuo Ren, Chen Chen, Shang-Wen Li, C. -C. Jay Kuo
The proposed Global-attributes Assisted Labeling (GAL) system exploits both local features and global attributes.
no code implementations • 28 Feb 2016 • Shang-Wen Li, Sanjay Purushotham, Chen Chen, Yuzhuo Ren, C. -C. Jay Kuo
Textual data such as tags, sentence descriptions are combined with visual cues to reduce the semantic gap for image retrieval applications in today's Multimodal Image Retrieval (MIR) systems.