We thus introduce a learning-based framework that computes optimal attention weights for beamforming.
We can achieve this with a neural network that extracts the target SEs by conditioning it on clues representing the target SE classes.
The artifact component is defined as the SE error signal that cannot be represented as a linear combination of speech and noise sources.
This paper develops a framework that can perform denoising, dereverberation, and source separation accurately by using a relatively small number of microphones.
This paper proposes an approach for optimizing a Convolutional BeamFormer (CBF) that can jointly perform denoising (DN), dereverberation (DR), and source separation (SS).
Target sound extraction consists of extracting the sound of a target acoustic event (AE) class from a mixture of AE sounds.
Sound event localization aims at estimating the positions of sound sources in the environment with respect to an acoustic receiver (e. g. a microphone array).
Many subjective experiments have been performed to develop objective speech intelligibility measures, but the novel coronavirus outbreak has made it very difficult to conduct experiments in a laboratory.
Herein, attentions allow for capturing temporal dependencies in the audio signal by focusing on specific frames that are relevant for estimating the activity and direction-of-arrival of sound events at the current time-step.
Estimating the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
Recently an audio-visual target speaker extraction has been proposed that extracts target speech by using complementary audio and visual clues.
Developing microphone array technologies for a small number of microphones is important due to the constraints of many devices.
We also newly develop a BCD for a semiblind IVE in which the transfer functions for several super-Gaussian sources are given a priori.
In this paper, we propose instead a universal sound selection neural network that enables to directly select AE sounds from a mixture given user-specified target AE classes.
Automatic meeting analysis is an essential fundamental technology required to let, e. g. smart devices follow and respond to our conversations.
First, we propose a time-domain implementation of SpeakerBeam similar to that proposed for a time-domain audio separation network (TasNet), which has achieved state-of-the-art performance for speech separation.
While significant progress has been made on individual tasks, this paper presents for the first time an all-neural approach to simultaneous speaker counting, diarization and source separation.