Search Results for author: Xuankai Chang

Found 51 papers, 15 papers with code

SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR

1 code implementation7 Dec 2024 Pengcheng Guo, Xuankai Chang, Hang Lv, Shinji Watanabe, Lei Xie

With data augmentation, we establish new state-of-the-art WERs of 14. 6% on the Libri2Mix Test set and 4. 4% on the WSJ0-2Mix Test set.

Automatic Speech Recognition Data Augmentation +3

Robust Audiovisual Speech Recognition Models with Mixture-of-Experts

no code implementations19 Sep 2024 Yihan Wu, Yifan Peng, Yichen Lu, Xuankai Chang, Ruihua Song, Shinji Watanabe

Moreover, to incorporate visual information effectively, we inject visual information into the ASR model through a mixture-of-experts module.

Mixture-of-Experts Robust Speech Recognition +1

SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data

1 code implementation1 Aug 2024 Yichen Lu, Jiaqi Song, Xuankai Chang, Hengwei Bian, Soumi Maiti, Shinji Watanabe

In this work, we present SynesLM, an unified model which can perform three multimodal language understanding tasks: audio-visual automatic speech recognition(AV-ASR) and visual-aided speech/machine translation(VST/VMT).

Audio-Visual Speech Recognition Automatic Speech Recognition +5

Towards Robust Speech Representation Learning for Thousands of Languages

no code implementations30 Jun 2024 William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, Shinji Watanabe

We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold.

Representation Learning Self-Supervised Learning +1

ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

no code implementations12 Jun 2024 Jiatong Shi, Shih-Heng Wang, William Chen, Martijn Bartelds, Vanya Bannihatti Kumar, Jinchuan Tian, Xuankai Chang, Dan Jurafsky, Karen Livescu, Hung-Yi Lee, Shinji Watanabe

This paper presents ML-SUPERB~2. 0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models across downstream models, fine-tuning setups, and efficient model adaptation approaches.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

LV-CTC: Non-autoregressive ASR with CTC and latent variable models

no code implementations28 Mar 2024 Yuya Fujita, Shinji Watanabe, Xuankai Chang, Takashi Maekaku

In this paper, we propose a new model combining CTC and a latent variable model, which is one of the state-of-the-art models in the neural machine translation research field.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond

no code implementations9 Oct 2023 Jiatong Shi, William Chen, Dan Berrebbi, Hsiu-Hsuan Wang, Wei-Ping Huang, En-Pei Hu, Ho-Lam Chuang, Xuankai Chang, Yuxun Tang, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Shinji Watanabe

The 2023 Multilingual Speech Universal Performance Benchmark (ML-SUPERB) Challenge expands upon the acclaimed SUPERB framework, emphasizing self-supervised models in multilingual speech recognition and language identification.

Language Identification speech-recognition +1

Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing

no code implementations27 Sep 2023 Brian Yan, Xuankai Chang, Antonios Anastasopoulos, Yuya Fujita, Shinji Watanabe

Recent works in end-to-end speech-to-text translation (ST) have proposed multi-tasking methods with soft parameter sharing which leverage machine translation (MT) data via secondary encoders that map text inputs to an eventual cross-modal representation.

Decoder Machine Translation +4

Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning

no code implementations26 Sep 2023 William Chen, Jiatong Shi, Brian Yan, Dan Berrebbi, Wangyou Zhang, Yifan Peng, Xuankai Chang, Soumi Maiti, Shinji Watanabe

We show that further efficiency can be achieved with a vanilla HuBERT Base model, which can maintain 94% of XLS-R's performance with only 3% of the data, 4 GPUs, and limited trials.

Denoising Self-Supervised Learning

Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

no code implementations14 Sep 2023 Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-weon Jung, Xuankai Chang, Shinji Watanabe

We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation.

Decoder Language Modeling +5

A New Benchmark of Aphasia Speech Recognition and Detection Based on E-Branchformer and Multi-task Learning

2 code implementations19 May 2023 Jiyang Tang, William Chen, Xuankai Chang, Shinji Watanabe, Brian MacWhinney

Our system achieves state-of-the-art speaker-level detection accuracy (97. 3%), and a relative WER reduction of 11% for moderate Aphasia patients.

Multi-Task Learning speech-recognition +1

ML-SUPERB: Multilingual Speech Universal PERformance Benchmark

no code implementations18 May 2023 Jiatong Shi, Dan Berrebbi, William Chen, Ho-Lam Chung, En-Pei Hu, Wei Ping Huang, Xuankai Chang, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, Shinji Watanabe

Speech processing Universal PERformance Benchmark (SUPERB) is a leaderboard to benchmark the performance of Self-Supervised Learning (SSL) models on various speech processing tasks.

Automatic Speech Recognition Language Identification +3

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

1 code implementation25 Apr 2023 Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao, Shinji Watanabe

In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i. e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue.

Improving Perceptual Quality, Intelligibility, and Acoustics on VoIP Platforms

no code implementations16 Mar 2023 Joseph Konan, Ojas Bhargave, Shikhar Agnihotri, Hojeong Lee, Ankit Shah, Shuo Han, Yunyang Zeng, Amanda Shu, Haohui Liu, Xuankai Chang, Hamza Khalid, Minseon Gwak, Kawon Lee, Minjeong Kim, Bhiksha Raj

In this paper, we present a method for fine-tuning models trained on the Deep Noise Suppression (DNS) 2020 Challenge to improve their performance on Voice over Internet Protocol (VoIP) applications.

Multi-Task Learning Speech Enhancement +2

ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding

1 code implementation19 Jul 2022 Yen-Ju Lu, Xuankai Chang, Chenda Li, Wangyou Zhang, Samuele Cornell, Zhaoheng Ni, Yoshiki Masuyama, Brian Yan, Robin Scheibler, Zhong-Qiu Wang, Yu Tsao, Yanmin Qian, Shinji Watanabe

To showcase such integration, we performed experiments on carefully designed synthetic datasets for noisy-reverberant multi-channel ST and SLU tasks, which can be used as benchmark corpora for future research.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Two-Pass Low Latency End-to-End Spoken Language Understanding

no code implementations14 Jul 2022 Siddhant Arora, Siddharth Dalmia, Xuankai Chang, Brian Yan, Alan Black, Shinji Watanabe

End-to-end (E2E) models are becoming increasingly popular for spoken language understanding (SLU) systems and are beginning to achieve competitive performance to pipeline-based approaches.

speech-recognition Speech Recognition +2

End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation

no code implementations1 Apr 2022 Xuankai Chang, Takashi Maekaku, Yuya Fujita, Shinji Watanabe

This work presents our end-to-end (E2E) automatic speech recognition (ASR) model targetting at robust speech recognition, called Integraded speech Recognition with enhanced speech Input for Self-supervised learning representation (IRIS).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Joint Speech Recognition and Audio Captioning

no code implementations3 Feb 2022 Chaitanya Narisetty, Emiru Tsunoo, Xuankai Chang, Yosuke Kashiwagi, Michael Hentschel, Shinji Watanabe

A major hurdle in evaluating our proposed approach is the lack of labeled audio datasets with both speech transcriptions and audio captions.

AudioCaps Audio captioning +4

Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem

no code implementations17 Dec 2021 Jing Shi, Xuankai Chang, Tomoki Hayashi, Yen-Ju Lu, Shinji Watanabe, Bo Xu

Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols, and convert the paradigm of the speech separation/enhancement related tasks from regression to classification.

regression Speech Separation

Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain

1 code implementation16 Jun 2021 Pengcheng Guo, Xuankai Chang, Shinji Watanabe, Lei Xie

Moreover, by including the data of variable numbers of speakers, our model can even better than the PIT-Conformer AR model with only 1/7 latency, obtaining WERs of 19. 9% and 34. 3% on WSJ0-2mix and WSJ0-3mix sets.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Investigation of End-To-End Speaker-Attributed ASR for Continuous Multi-Talker Recordings

1 code implementation11 Aug 2020 Naoyuki Kanda, Xuankai Chang, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

However, the model required prior knowledge of speaker profiles to perform speaker identification, which significantly limited the application of the model.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

End-to-End Multi-speaker Speech Recognition with Transformer

no code implementations10 Feb 2020 Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe

Recently, fully recurrent neural network (RNN) based end-to-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios.

Decoder speech-recognition +1

MIMO-SPEECH: End-to-End Multi-Channel Multi-Speaker Speech Recognition

no code implementations15 Oct 2019 Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe

In this work, we propose a novel neural sequence-to-sequence (seq2seq) architecture, MIMO-Speech, which extends the original seq2seq to deal with multi-channel input and multi-channel output so that it can fully model multi-channel multi-speaker speech separation and recognition.

speech-recognition Speech Recognition +1

End-to-End Monaural Multi-speaker ASR System without Pretraining

no code implementations5 Nov 2018 Xuankai Chang, Yanmin Qian, Kai Yu, Shinji Watanabe

The experiments demonstrate that the proposed methods can improve the performance of the end-to-end model in separating the overlapping speech and recognizing the separated streams.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training

no code implementations19 Jul 2017 Yanmin Qian, Xuankai Chang, Dong Yu

Although great progresses have been made in automatic speech recognition (ASR), significant performance degradation is still observed when recognizing multi-talker mixed speech.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Cannot find the paper you are looking for? You can Submit a new open access paper.