no code implementations • 30 May 2023 • Chenda Li, Yao Qian, Zhuo Chen, Naoyuki Kanda, Dongmei Wang, Takuya Yoshioka, Yanmin Qian, Michael Zeng
State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 23 May 2023 • Yuwei Fang, Mahmoud Khademi, Chenguang Zhu, ZiYi Yang, Reid Pryzant, Yichong Xu, Yao Qian, Takuya Yoshioka, Lu Yuan, Michael Zeng, Xuedong Huang
Artificial General Intelligence (AGI) requires comprehensive understanding and generation capabilities for a variety of tasks spanning different modalities and functionalities.
no code implementations • 21 May 2023 • ZiYi Yang, Mahmoud Khademi, Yichong Xu, Reid Pryzant, Yuwei Fang, Chenguang Zhu, Dongdong Chen, Yao Qian, Mei Gao, Yi-Ling Chen, Robert Gmyr, Naoyuki Kanda, Noel Codella, Bin Xiao, Yu Shi, Lu Yuan, Takuya Yoshioka, Michael Zeng, Xuedong Huang
The convergence of text, visual, and audio data is a key step towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models which lack generative abilities.
1 code implementation • 15 Mar 2023 • Chenda Li, Yao Qian, Zhuo Chen, Dongmei Wang, Takuya Yoshioka, Shujie Liu, Yanmin Qian, Michael Zeng
Automatic target sound extraction (TSE) is a machine learning approach to mimic the human auditory perception capability of attending to a sound source of interest from a mixture of sources.
no code implementations • 24 Feb 2023 • Naoyuki Kanda, Takuya Yoshioka, Yang Liu
This paper presents a novel optimization framework for automatic speech recognition (ASR) with the aim of reducing hallucinations produced by an ASR model.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 18 Nov 2022 • Hyungchan Song, Sanyuan Chen, Zhuo Chen, Yu Wu, Takuya Yoshioka, Min Tang, Jong Won Shin, Shujie Liu
There is a surge in interest in self-supervised learning approaches for end-to-end speech encoding in recent years as they have achieved great success.
no code implementations • 11 Nov 2022 • Xiaofei Wang, Zhuo Chen, Yu Shi, Jian Wu, Naoyuki Kanda, Takuya Yoshioka
Employing a monaural speech separation (SS) model as a front-end for automatic speech recognition (ASR) involves balancing two kinds of trade-offs.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 10 Nov 2022 • Zili Huang, Zhuo Chen, Naoyuki Kanda, Jian Wu, Yiming Wang, Jinyu Li, Takuya Yoshioka, Xiaofei Wang, Peidong Wang
In this paper, we investigate SSL for streaming multi-talker speech recognition, which generates transcriptions of overlapping speakers in a streaming fashion.
no code implementations • 9 Nov 2022 • Zhuo Chen, Naoyuki Kanda, Jian Wu, Yu Wu, Xiaofei Wang, Takuya Yoshioka, Jinyu Li, Sunit Sivasankaran, Sefik Emre Eskimez
Compared with a supervised baseline and the WavLM-based SS model using feature embeddings obtained with the previously released 94K hours trained WavLM, our proposed model obtains 15. 9% and 11. 2% of relative word error rate (WER) reductions, respectively, for a simulated far-field speech mixture test set.
no code implementations • 5 Nov 2022 • Hassan Taherian, Sefik Emre Eskimez, Takuya Yoshioka
This prevents the PSE model from being too aggressive while still allowing the model to learn to suppress the input speech when it is likely to be spoken by interfering speakers.
no code implementations • 4 Nov 2022 • Sefik Emre Eskimez, Takuya Yoshioka, Alex Ju, Min Tang, Tanel Parnamaa, Huaming Wang
Personalized speech enhancement (PSE) is a real-time SE approach utilizing a speaker embedding of a target person to remove background noise, reverberation, and interfering voices.
1 code implementation • 4 Nov 2022 • Bandhav Veluri, Justin Chan, Malek Itani, Tuochao Chen, Takuya Yoshioka, Shyamnath Gollakota
We present the first neural network model to achieve real-time and streaming target sound extraction.
Ranked #1 on
Streaming Target Sound Extraction
on FSDSoundScapes
no code implementations • 27 Oct 2022 • Muqiao Yang, Naoyuki Kanda, Xiaofei Wang, Jian Wu, Sunit Sivasankaran, Zhuo Chen, Jinyu Li, Takuya Yoshioka
Multi-talker automatic speech recognition (ASR) has been studied to generate transcriptions of natural conversation including overlapping speech of multiple speakers.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 12 Sep 2022 • Naoyuki Kanda, Jian Wu, Xiaofei Wang, Zhuo Chen, Jinyu Li, Takuya Yoshioka
To combine the best of both technologies, we newly design a t-SOT-based ASR model that generates a serialized multi-talker transcription based on two separated speech signals from VarArray.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 27 Aug 2022 • Dongmei Wang, Xiong Xiao, Naoyuki Kanda, Takuya Yoshioka, Jian Wu
This paper describes a speaker diarization model based on target speaker voice activity detection (TS-VAD) using transformers.
no code implementations • 3 May 2022 • ZiYi Yang, Yuwei Fang, Chenguang Zhu, Reid Pryzant, Dongdong Chen, Yu Shi, Yichong Xu, Yao Qian, Mei Gao, Yi-Ling Chen, Liyang Lu, Yujia Xie, Robert Gmyr, Noel Codella, Naoyuki Kanda, Bin Xiao, Lu Yuan, Takuya Yoshioka, Michael Zeng, Xuedong Huang
Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview.
no code implementations • 27 Apr 2022 • Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Takuya Yoshioka, Shujie Liu, Jinyu Li, Xiangzhan Yu
In this paper, an ultra fast speech separation Transformer model is proposed to achieve both better performance and efficiency with teacher student learning (T-S learning).
no code implementations • 7 Apr 2022 • Xiaofei Wang, Dongmei Wang, Naoyuki Kanda, Sefik Emre Eskimez, Takuya Yoshioka
In this paper, we propose a three-stage training scheme for the CSS model that can leverage both supervised data and extra large-scale unsupervised real-world conversational data.
no code implementations • 2 Apr 2022 • Manthan Thakker, Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang
Our results show that E3Net provides better speech and transcription quality with a lower target speaker over-suppression (TSOS) rate than the baseline model.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
no code implementations • 30 Mar 2022 • Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka
The proposed speaker embedding, named t-vector, is extracted synchronously with the t-SOT ASR model, enabling joint execution of speaker identification (SID) or speaker diarization (SD) with the multi-talker transcription with low latency.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
1 code implementation • 27 Feb 2022 • Harishchandra Dubey, Vishak Gopal, Ross Cutler, Ashkan Aazami, Sergiy Matusevych, Sebastian Braun, Sefik Emre Eskimez, Manthan Thakker, Takuya Yoshioka, Hannes Gamper, Robert Aichner
We open-source datasets and test sets for researchers to train their deep noise suppression models, as well as a subjective evaluation framework based on ITU-T P. 835 to rate and rank-order the challenge entries.
no code implementations • 2 Feb 2022 • Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, Jinyu Li, Takuya Yoshioka
This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR).
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 24 Jan 2022 • Takuya Yoshioka, Xiaofei Wang, Dongmei Wang
Since PickNet utilizes only limited acoustic context at each time frame, the system using the proposed model works in real time and is robust to changes in acoustic conditions.
no code implementations • 28 Oct 2021 • Heming Wang, Yao Qian, Xiaofei Wang, Yiming Wang, Chengyi Wang, Shujie Liu, Takuya Yoshioka, Jinyu Li, DeLiang Wang
The reconstruction module is used for auxiliary learning to improve the noise robustness of the learned representation and thus is not required during inference.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+8
no code implementations • 28 Oct 2021 • Yixuan Zhang, Zhuo Chen, Jian Wu, Takuya Yoshioka, Peidong Wang, Zhong Meng, Jinyu Li
In this paper, we propose to apply recurrent selective attention network (RSAN) to CSS, which generates a variable number of output channels based on active speaker counting.
no code implementations • 27 Oct 2021 • Wangyou Zhang, Zhuo Chen, Naoyuki Kanda, Shujie Liu, Jinyu Li, Sefik Emre Eskimez, Takuya Yoshioka, Xiong Xiao, Zhong Meng, Yanmin Qian, Furu Wei
Multi-talker conversational speech processing has drawn many interests for various applications such as meeting transcription.
5 code implementations • 26 Oct 2021 • Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, Furu Wei
Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks.
no code implementations • 20 Oct 2021 • Hassan Taherian, Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang, Zhuo Chen, Xuedong Huang
Experimental results show that the proposed geometry agnostic model outperforms the model trained on a specific microphone array geometry in both speech quality and automatic speech recognition accuracy.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 18 Oct 2021 • Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang, Xiaofei Wang, Zhuo Chen, Xuedong Huang
Our results show that the proposed models can yield better speech recognition accuracy, speech intelligibility, and perceptual quality than the baseline models, and the multi-task training can alleviate the TSOS issue in addition to improving the speech recognition accuracy.
no code implementations • 13 Oct 2021 • Zhuohuang Zhang, Takuya Yoshioka, Naoyuki Kanda, Zhuo Chen, Xiaofei Wang, Dongmei Wang, Sefik Emre Eskimez
Recently, the all deep learning MVDR (ADL-MVDR) model was proposed for neural beamforming and demonstrated superior performance in a target speech extraction task using pre-segmented input.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 12 Oct 2021 • Takuya Yoshioka, Xiaofei Wang, Dongmei Wang, Min Tang, Zirun Zhu, Zhuo Chen, Naoyuki Kanda
Continuous speech separation using a microphone array was shown to be promising in dealing with the speech overlap problem in natural conversation transcription.
no code implementations • 7 Oct 2021 • Naoyuki Kanda, Xiong Xiao, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka
Similar to the target-speaker voice activity detection (TS-VAD)-based diarization method, the E2E SA-ASR model is applied to estimate speech activity of each speaker while it has the advantages of (i) handling unlimited number of speakers, (ii) leveraging linguistic information for speaker diarization, and (iii) simultaneously generating speaker-attributed transcriptions.
no code implementations • 6 Jul 2021 • Naoyuki Kanda, Xiong Xiao, Jian Wu, Tianyan Zhou, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka
Our evaluation on the AMI meeting corpus reveals that after fine-tuning with a small real data, the joint system performs 8. 9--29. 9% better in accuracy compared to the best modular system while the modular system performs better before such fine-tuning.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+5
no code implementations • 5 Jul 2021 • Jian Wu, Zhuo Chen, Sanyuan Chen, Yu Wu, Takuya Yoshioka, Naoyuki Kanda, Shujie Liu, Jinyu Li
Speech separation has been successfully applied as a frontend processing module of conversation transcription systems thanks to its ability to handle overlapped speech and its flexibility to combine with downstream tasks such as automatic speech recognition (ASR).
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • 5 Jun 2021 • Sefik Emre Eskimez, Xiaofei Wang, Min Tang, Hemin Yang, Zirun Zhu, Zhuo Chen, Huaming Wang, Takuya Yoshioka
Performance analysis is also carried out by changing the ASR model, the data used for the ASR-step, and the schedule of the two update steps.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 5 Apr 2021 • Naoyuki Kanda, Guoli Ye, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka
This paper presents our recent effort on end-to-end speaker-attributed automatic speech recognition, which jointly performs speaker counting, speech recognition and speaker identification for monaural multi-talker audio.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 31 Mar 2021 • Naoyuki Kanda, Guoli Ye, Yu Wu, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka
Transcribing meetings containing overlapped speech with only a single distant microphone (SDM) has been one of the most challenging problems for automatic speech recognition (ASR).
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 3 Mar 2021 • Dongmei Wang, Takuya Yoshioka, Zhuo Chen, Xiaofei Wang, Tianyan Zhou, Zhong Meng
Prior studies show, with a spatial-temporalinterleaving structure, neural networks can efficiently utilize the multi-channel signals of the ad hoc array.
no code implementations • 6 Jan 2021 • Xuankai Chang, Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka
Then, we propose a novel method using a sequence-to-sequence model, called hypothesis stitcher.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 3 Nov 2020 • Desh Raj, Pavel Denisov, Zhuo Chen, Hakan Erdogan, Zili Huang, Maokui He, Shinji Watanabe, Jun Du, Takuya Yoshioka, Yi Luo, Naoyuki Kanda, Jinyu Li, Scott Wisdom, John R. Hershey
Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
1 code implementation • 3 Nov 2020 • Naoyuki Kanda, Zhong Meng, Liang Lu, Yashesh Gaur, Xiaofei Wang, Zhuo Chen, Takuya Yoshioka
Recently, an end-to-end speaker-attributed automatic speech recognition (E2E SA-ASR) model was proposed as a joint model of speaker counting, speech recognition and speaker identification for monaural overlapped speech.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
1 code implementation • 23 Oct 2020 • Sanyuan Chen, Yu Wu, Zhuo Chen, Takuya Yoshioka, Shujie Liu, Jinyu Li
With its strong modeling capacity that comes from a multi-head and multi-layer structure, Transformer is a very powerful model for learning a sequential representation and has been successfully applied to speech separation recently.
no code implementations • 7 Sep 2020 • Jian Wu, Zhuo Chen, Jinyu Li, Takuya Yoshioka, Zhili Tan, Ed Lin, Yi Luo, Lei Xie
Previously, we introduced a sys-tem, calledunmixing, fixed-beamformerandextraction(UFE), that was shown to be effective in addressing the speech over-lap problem in conversation transcription.
1 code implementation • 13 Aug 2020 • Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Jinyu Li, Takuya Yoshioka, Chengyi Wang, Shujie Liu, Ming Zhou
Continuous speech separation plays a vital role in complicated speech related tasks such as conversation transcription.
Ranked #1 on
Speech Separation
on LibriCSS
(using extra training data)
1 code implementation • 11 Aug 2020 • Naoyuki Kanda, Xuankai Chang, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka
However, the model required prior knowledge of speaker profiles to perform speaker identification, which significantly limited the application of the model.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 19 Jun 2020 • Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Tianyan Zhou, Takuya Yoshioka
We propose an end-to-end speaker-attributed automatic speech recognition model that unifies speaker counting, speech recognition, and speaker identification on monaural overlapped speech.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 28 Apr 2020 • Dongmei Wang, Zhuo Chen, Takuya Yoshioka
The inter-channel processing layers apply a self-attention mechanism along the channel dimension to exploit the information obtained with a varying number of microphones.
no code implementations • 20 Apr 2020 • Shinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vincent, Ashish Arora, Xuankai Chang, Sanjeev Khudanpur, Vimal Manohar, Daniel Povey, Desh Raj, David Snyder, Aswin Shanmugam Subramanian, Jan Trmal, Bar Ben Yair, Christoph Boeddeker, Zhaoheng Ni, Yusuke Fujita, Shota Horiguchi, Naoyuki Kanda, Takuya Yoshioka, Neville Ryant
Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6).
no code implementations • 28 Mar 2020 • Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka
We also show that the SOT models can accurately count the number of speakers in the input audio.
1 code implementation • 30 Jan 2020 • Zhuo Chen, Takuya Yoshioka, Liang Lu, Tianyan Zhou, Zhong Meng, Yi Luo, Jian Wu, Xiong Xiao, Jinyu Li
In this paper, we define continuous speech separation (CSS) as a task of generating a set of non-overlapped speech signals from a \textit{continuous} audio stream that contains multiple utterances that are \emph{partially} overlapped by a varying degree.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 10 Dec 2019 • Takuya Yoshioka, Igor Abramovski, Cem Aksoylar, Zhuo Chen, Moshe David, Dimitrios Dimitriadis, Yifan Gong, Ilya Gurvich, Xuedong Huang, Yan Huang, Aviv Hurvitz, Li Jiang, Sharon Koubi, Eyal Krupka, Ido Leichter, Changliang Liu, Partha Parthasarathy, Alon Vinnikov, Lingfeng Wu, Xiong Xiao, Wayne Xiong, Huaming Wang, Zhenghao Wang, Jun Zhang, Yong Zhao, Tianyan Zhou
This increases marginally to 1. 6% when 50% of the attendees are unknown to the system.
2 code implementations • 30 Oct 2019 • Yi Luo, Zhuo Chen, Nima Mesgarani, Takuya Yoshioka
An important problem in ad-hoc microphone speech separation is how to guarantee the robustness of a system with respect to the locations and numbers of microphones.
8 code implementations • 14 Oct 2019 • Yi Luo, Zhuo Chen, Takuya Yoshioka
Recent studies in deep learning-based speech separation have proven the superiority of time-domain approaches to conventional time-frequency-based methods.
Ranked #15 on
Speech Separation
on WSJ0-2mix
2 code implementations • 17 Sep 2019 • Andreas Stolcke, Takuya Yoshioka
Speech recognition and other natural language tasks have long benefited from voting-based algorithms as a method to aggregate outputs from several systems to achieve a higher accuracy than any of the individual systems.
no code implementations • 3 May 2019 • Takuya Yoshioka, Zhuo Chen, Dimitrios Dimitriadis, William Hinthorn, Xuedong Huang, Andreas Stolcke, Michael Zeng
The speaker-attributed WER (SAWER) is 26. 7%.
no code implementations • 13 Apr 2019 • Takuya Yoshioka, Zhuo Chen, Changliang Liu, Xiong Xiao, Hakan Erdogan, Dimitrios Dimitriadis
Speaker independent continuous speech separation (SI-CSS) is a task of converting a continuous audio stream, which may contain overlapping voices of unknown speakers, into a fixed number of continuous signals each of which contains no overlapping speech segment.
no code implementations • 8 Oct 2018 • Takuya Yoshioka, Hakan Erdogan, Zhuo Chen, Xiong Xiao, Fil Alleva
The goal of this work is to develop a meeting transcription system that can recognize speech even when utterances of different speakers are overlapped.