Search Results for author: Wangyou Zhang

Found 18 papers, 5 papers with code

SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition

no code implementations • 31 Jan 2024 • Yihan Wu, Soumi Maiti, Yifan Peng, Wangyou Zhang, Chenda Li, Yuyue Wang, Xihua Wang, Shinji Watanabe, Ruihua Song

Existing speech language models typically utilize task-dependent prompt tokens to unify various speech tasks in a single model.

Language Modelling Speech Enhancement +4

Paper
Add Code

ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models

2 code implementations • 30 Jan 2024 • Jee-weon Jung, Wangyou Zhang, Jiatong Shi, Zakaria Aldeneh, Takuya Higuchi, Barry-John Theobald, Ahmed Hussen Abdelaziz, Shinji Watanabe

First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models.

Ranked #1 on Speaker Verification on VoxCeleb (using extra training data)

Self-Supervised Learning Speaker Recognition +1

7,851

Paper
Code

Improving Design of Input Condition Invariant Speech Enhancement

1 code implementation • 25 Jan 2024 • Wangyou Zhang, Jee-weon Jung, Shinji Watanabe, Yanmin Qian

In this paper we propose novel architectures to improve the input condition invariant SE model so that performance in simulated conditions remains competitive while real condition degradation is much mitigated.

Speech Enhancement

7,851

Paper
Code

A Single Speech Enhancement Model Unifying Dereverberation, Denoising, Speaker Counting, Separation, and Extraction

no code implementations • 12 Oct 2023 • Kohei Saijo, Wangyou Zhang, Zhong-Qiu Wang, Shinji Watanabe, Tetsunori Kobayashi, Tetsuji Ogawa

We propose a multi-task universal speech enhancement (MUSE) model that can perform five speech enhancement (SE) tasks: dereverberation, denoising, speech separation (SS), target speaker extraction (TSE), and speaker counting.

Denoising Speech Enhancement +2

Paper
Add Code

Toward Universal Speech Enhancement for Diverse Input Conditions

no code implementations • 29 Sep 2023 • Wangyou Zhang, Kohei Saijo, Zhong-Qiu Wang, Shinji Watanabe, Yanmin Qian

Currently, there is no universal SE approach that can effectively handle diverse input conditions with a single model.

Denoising Speech Enhancement

Paper
Add Code

Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning

no code implementations • 26 Sep 2023 • William Chen, Jiatong Shi, Brian Yan, Dan Berrebbi, Wangyou Zhang, Yifan Peng, Xuankai Chang, Soumi Maiti, Shinji Watanabe

We show that further efficiency can be achieved with a vanilla HuBERT Base model, which can maintain 94% of XLS-R's performance with only 3% of the data, 4 GPUs, and limited trials.

Denoising Self-Supervised Learning

Paper
Add Code

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

1 code implementation • 25 Sep 2023 • Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, Shinji Watanabe

Pre-training speech models on large volumes of data has achieved remarkable success.

Speech Recognition Translation

7,851

Paper
Code

Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation

no code implementations • 23 Jul 2023 • Yoshiki Masuyama, Xuankai Chang, Wangyou Zhang, Samuele Cornell, Zhong-Qiu Wang, Nobutaka Ono, Yanmin Qian, Shinji Watanabe

In detail, we explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition

no code implementations • 25 May 2023 • Wangyou Zhang, Yanmin Qian

Self-supervised learning (SSL) based speech pre-training has attracted much attention for its capability of extracting rich representations learned from massive unlabeled data.

Denoising Self-Supervised Learning +2

Paper
Add Code

ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding

1 code implementation • 19 Jul 2022 • Yen-Ju Lu, Xuankai Chang, Chenda Li, Wangyou Zhang, Samuele Cornell, Zhaoheng Ni, Yoshiki Masuyama, Brian Yan, Robin Scheibler, Zhong-Qiu Wang, Yu Tsao, Yanmin Qian, Shinji Watanabe

To showcase such integration, we performed experiments on carefully designed synthetic datasets for noisy-reverberant multi-channel ST and SLU tasks, which can be used as benchmark corpora for future research.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

7,851

Paper
Code

End-to-End Multi-speaker ASR with Independent Vector Analysis

no code implementations • 1 Apr 2022 • Robin Scheibler, Wangyou Zhang, Xuankai Chang, Shinji Watanabe, Yanmin Qian

We develop an end-to-end system for multi-channel, multi-speaker automatic speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Towards Low-distortion Multi-channel Speech Enhancement: The ESPNet-SE Submission to The L3DAS22 Challenge

no code implementations • 24 Feb 2022 • Yen-Ju Lu, Samuele Cornell, Xuankai Chang, Wangyou Zhang, Chenda Li, Zhaoheng Ni, Zhong-Qiu Wang, Shinji Watanabe

This paper describes our submission to the L3DAS22 Challenge Task 1, which consists of speech enhancement with 3D Ambisonic microphones.

Speech Enhancement

Paper
Add Code

Closing the Gap Between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions

no code implementations • 27 Oct 2021 • Wangyou Zhang, Jing Shi, Chenda Li, Shinji Watanabe, Yanmin Qian

The deep learning based time-domain models, e. g. Conv-TasNet, have shown great potential in both single-channel and multi-channel speech enhancement.

Speech Enhancement speech-recognition +1

Paper
Add Code

Separating Long-Form Speech with Group-Wise Permutation Invariant Training

no code implementations • 27 Oct 2021 • Wangyou Zhang, Zhuo Chen, Naoyuki Kanda, Shujie Liu, Jinyu Li, Sefik Emre Eskimez, Takuya Yoshioka, Xiong Xiao, Zhong Meng, Yanmin Qian, Furu Wei

Multi-talker conversational speech processing has drawn many interests for various applications such as meeting transcription.

Speech Separation

Paper
Add Code

End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend

no code implementations • 23 Feb 2021 • Wangyou Zhang, Christoph Boeddeker, Shinji Watanabe, Tomohiro Nakatani, Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Naoyuki Kamo, Reinhold Haeb-Umbach, Yanmin Qian

Recently, the end-to-end approach has been successfully applied to multi-speaker speech separation and recognition in both single-channel and multichannel conditions.

Action Detection Activity Detection +4

Paper
Add Code

End-to-End Multi-speaker Speech Recognition with Transformer

no code implementations • 10 Feb 2020 • Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe

Recently, fully recurrent neural network (RNN) based end-to-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios.

speech-recognition Speech Recognition

Paper
Add Code

MIMO-SPEECH: End-to-End Multi-Channel Multi-Speaker Speech Recognition

no code implementations • 15 Oct 2019 • Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe

In this work, we propose a novel neural sequence-to-sequence (seq2seq) architecture, MIMO-Speech, which extends the original seq2seq to deal with multi-channel input and multi-channel output so that it can fully model multi-channel multi-speaker speech separation and recognition.

speech-recognition Speech Recognition +1

Paper
Add Code

A Comparative Study on Transformer vs RNN in Speech Applications

1 code implementation • 13 Sep 2019 • Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, Shinji Watanabe, Takenori Yoshimura, Wangyou Zhang

Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS).

Ranked #12 on Speech Recognition on AISHELL-1

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

7,851

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.