no code implementations • 11 Nov 2024 • Midia Yousefi, Yao Qian, Junkun Chen, Gang Wang, Yanqing Liu, Dongmei Wang, Xiaofei Wang, Jian Xue
End-to-end speech translation (ST), which translates source language speech directly into target language text, has garnered significant attention in recent years.
1 code implementation • 28 May 2024 • Chenyang Le, Yao Qian, Dongmei Wang, Long Zhou, Shujie Liu, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Sheng Zhao, Michael Zeng
There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation.
1 code implementation • 10 Apr 2024 • Leying Zhang, Yao Qian, Long Zhou, Shujie Liu, Dongmei Wang, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Lei He, Sheng Zhao, Michael Zeng
In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation.
no code implementations • 30 May 2023 • Chenda Li, Yao Qian, Zhuo Chen, Naoyuki Kanda, Dongmei Wang, Takuya Yoshioka, Yanmin Qian, Michael Zeng
State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
1 code implementation • 15 Mar 2023 • Chenda Li, Yao Qian, Zhuo Chen, Dongmei Wang, Takuya Yoshioka, Shujie Liu, Yanmin Qian, Michael Zeng
Automatic target sound extraction (TSE) is a machine learning approach to mimic the human auditory perception capability of attending to a sound source of interest from a mixture of sources.
no code implementations • 27 Aug 2022 • Dongmei Wang, Xiong Xiao, Naoyuki Kanda, Takuya Yoshioka, Jian Wu
This paper describes a speaker diarization model based on target speaker voice activity detection (TS-VAD) using transformers.
no code implementations • 7 Apr 2022 • Xiaofei Wang, Dongmei Wang, Naoyuki Kanda, Sefik Emre Eskimez, Takuya Yoshioka
In this paper, we propose a three-stage training scheme for the CSS model that can leverage both supervised data and extra large-scale unsupervised real-world conversational data.
no code implementations • 24 Jan 2022 • Takuya Yoshioka, Xiaofei Wang, Dongmei Wang
Since PickNet utilizes only limited acoustic context at each time frame, the system using the proposed model works in real time and is robust to changes in acoustic conditions.
no code implementations • 13 Oct 2021 • Zhuohuang Zhang, Takuya Yoshioka, Naoyuki Kanda, Zhuo Chen, Xiaofei Wang, Dongmei Wang, Sefik Emre Eskimez
Recently, the all deep learning MVDR (ADL-MVDR) model was proposed for neural beamforming and demonstrated superior performance in a target speech extraction task using pre-segmented input.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 12 Oct 2021 • Takuya Yoshioka, Xiaofei Wang, Dongmei Wang, Min Tang, Zirun Zhu, Zhuo Chen, Naoyuki Kanda
Continuous speech separation using a microphone array was shown to be promising in dealing with the speech overlap problem in natural conversation transcription.
no code implementations • 3 Mar 2021 • Dongmei Wang, Takuya Yoshioka, Zhuo Chen, Xiaofei Wang, Tianyan Zhou, Zhong Meng
Prior studies show, with a spatial-temporalinterleaving structure, neural networks can efficiently utilize the multi-channel signals of the ad hoc array.
no code implementations • 28 Apr 2020 • Dongmei Wang, Zhuo Chen, Takuya Yoshioka
The inter-channel processing layers apply a self-attention mechanism along the channel dimension to exploit the information obtained with a varying number of microphones.