This paper develops a framework that can perform denoising, dereverberation, and source separation accurately by using a relatively small number of microphones.
This paper proposes an approach for optimizing a Convolutional BeamFormer (CBF) that can jointly perform denoising (DN), dereverberation (DR), and source separation (SS).
Sound event localization aims at estimating the positions of sound sources in the environment with respect to an acoustic receiver (e. g. a microphone array).
Many subjective experiments have been performed to develop objective speech intelligibility measures, but the novel coronavirus outbreak has made it very difficult to conduct experiments in a laboratory.
Herein, attentions allow for capturing temporal dependencies in the audio signal by focusing on specific frames that are relevant for estimating the activity and direction-of-arrival of sound events at the current time-step.
Estimating the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
no code implementations • 23 Feb 2021 • Wangyou Zhang, Christoph Boeddeker, Shinji Watanabe, Tomohiro Nakatani, Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Naoyuki Kamo, Reinhold Haeb-Umbach, Yanmin Qian
Recently, the end-to-end approach has been successfully applied to multi-speaker speech separation and recognition in both single-channel and multichannel conditions.
We address a blind source separation (BSS) problem in a noisy reverberant environment in which the number of microphones $M$ is greater than the number of sources of interest, and the other noise components can be approximated as stationary and Gaussian distributed.
Recently an audio-visual target speaker extraction has been proposed that extracts target speech by using complementary audio and visual clues.
Based on this approach, we present FastFCA, a computationally efficient extension of FCA.
Audio Source Separation Sound Audio and Speech Processing
Target speech extraction, which extracts the speech of a target speaker in a mixture given auxiliary speaker clues, has recently received increased interest.
Developing microphone array technologies for a small number of microphones is important due to the constraints of many devices.
In this paper, we propose a method combining variational autoencoder model of speech with a spatial clustering approach for multi-channel speech separation.
Audio and Speech Processing
We also newly develop a BCD for a semiblind IVE in which the transfer functions for several super-Gaussian sources are given a priori.
Most approaches to multi-talker overlapped speech separation and recognition assume that the number of simultaneously active speakers is given, but in realistic situations, it is typically unknown.
Automatic meeting analysis is an essential fundamental technology required to let, e. g. smart devices follow and respond to our conversations.
With the advent of deep learning, research on noise-robust automatic speech recognition (ASR) has progressed rapidly.
First, we propose a time-domain implementation of SpeakerBeam similar to that proposed for a time-domain audio separation network (TasNet), which has achieved state-of-the-art performance for speech separation.
The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multi-speaker speech recognition.
We previously proposed an optimal (in the maximum likelihood sense) convolutional beamformer that can perform simultaneous denoising and dereverberation, and showed its superiority over the widely used cascade of a WPE dereverberation filter and a conventional MPDR beamformer.
While significant progress has been made on individual tasks, this paper presents for the first time an all-neural approach to simultaneous speaker counting, diarization and source separation.