Search Results for author: Keisuke Kinoshita

Found 35 papers, 10 papers with code

Neural Target Speech Extraction: An Overview

1 code implementation31 Jan 2023 Katerina Zmolikova, Marc Delcroix, Tsubasa Ochiai, Keisuke Kinoshita, Jan Černocký, Dong Yu

Humans can listen to a target speaker even in challenging acoustic conditions that have noise, reverberation, and interfering speakers.

Speech Extraction

On Word Error Rate Definitions and their Efficient Computation for Multi-Speaker Speech Recognition Systems

1 code implementation29 Nov 2022 Thilo von Neumann, Christoph Boeddeker, Keisuke Kinoshita, Marc Delcroix, Reinhold Haeb-Umbach

We propose a general framework to compute the word error rate (WER) of ASR systems that process recordings containing multiple speakers at their input and that produce multiple output word sequences (MIMO).

speech-recognition Speech Recognition

Utterance-by-utterance overlap-aware neural diarization with Graph-PIT

1 code implementation28 Jul 2022 Keisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Christoph Boeddeker, Reinhold Haeb-Umbach

In this paper, we argue that such an approach involving the segmentation has several issues; for example, it inevitably faces a dilemma that larger segment sizes increase both the context available for enhancing the performance and the number of speakers for the local EEND module to handle.

Clustering Segmentation +2

Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations

no code implementations16 Jun 2022 Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Ryo Masumura

Experimental validation reveals the effectiveness of both worst-enrollment target training and SI-loss training to improve robustness against enrollment variations, by increasing speaker discriminability.

Speaker Identification Speech Extraction

Listen only to me! How well can target speech extraction handle false alarms?

no code implementations11 Apr 2022 Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Katerina Zmolikova, Hiroshi Sato, Tomohiro Nakatani

Target speech extraction (TSE) extracts the speech of a target speaker in a mixture given auxiliary clues characterizing the speaker, such as an enrollment utterance.

Speaker Identification Speaker Verification +2

SA-SDR: A novel loss function for separation of meeting style data

no code implementations29 Oct 2021 Thilo von Neumann, Keisuke Kinoshita, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach

Many state-of-the-art neural network-based source separation systems use the averaged Signal-to-Distortion Ratio (SDR) as a training objective function.

Graph-PIT: Generalized permutation invariant training for continuous separation of arbitrary numbers of speakers

1 code implementation30 Jul 2021 Thilo von Neumann, Keisuke Kinoshita, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach

When processing meeting-like data in a segment-wise manner, i. e., by separating overlapping segments independently and stitching adjacent segments to continuous output streams, this constraint has to be fulfilled for any segment.

Speech Separation

Speeding Up Permutation Invariant Training for Source Separation

1 code implementation30 Jul 2021 Thilo von Neumann, Christoph Boeddeker, Keisuke Kinoshita, Marc Delcroix, Reinhold Haeb-Umbach

The Hungarian algorithm can be used for uPIT and we introduce various algorithms for the Graph-PIT assignment problem to reduce the complexity to be polynomial in the number of utterances.

Few-shot learning of new sound classes for target sound extraction

no code implementations14 Jun 2021 Marc Delcroix, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita, Shoko Araki

Target sound extraction consists of extracting the sound of a target acoustic event (AE) class from a mixture of AE sounds.

Few-Shot Learning Target Sound Extraction

PILOT: Introducing Transformers for Probabilistic Sound Event Localization

1 code implementation7 Jun 2021 Christopher Schymura, Benedikt Bönninghoff, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa

Sound event localization aims at estimating the positions of sound sources in the environment with respect to an acoustic receiver (e. g. a microphone array).

Event Detection

Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlapping Speech Recognition

no code implementations2 Jun 2021 Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoyuki Kamo

', we analyze ASR performance on observed and enhanced speech at various noise and interference conditions, and show that speech enhancement degrades ASR under some conditions even for overlapping speech.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech

1 code implementation19 May 2021 Keisuke Kinoshita, Marc Delcroix, Naohiro Tawara

This paper is to (1) report recent advances we made to this framework, including newly introduced robust constrained clustering algorithms, and (2) experimentally show that the method can now significantly outperform competitive diarization methods such as Encoder-Decoder Attractor (EDA)-EEND, on CALLHOME data which comprises real conversational speech data including overlapped speech and an arbitrary number of speakers.

Constrained Clustering speaker-diarization +1

Comparison of remote experiments using crowdsourcing and laboratory experiments on speech intelligibility

no code implementations17 Apr 2021 Ayako Yamamoto, Toshio Irino, Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani

Many subjective experiments have been performed to develop objective speech intelligibility measures, but the novel coronavirus outbreak has made it very difficult to conduct experiments in a laboratory.

Speech Enhancement

Exploiting Attention-based Sequence-to-Sequence Architectures for Sound Event Localization

1 code implementation28 Feb 2021 Christopher Schymura, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa

Herein, attentions allow for capturing temporal dependencies in the audio signal by focusing on specific frames that are relevant for estimating the activity and direction-of-arrival of sound events at the current time-step.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Multimodal Attention Fusion for Target Speaker Extraction

no code implementations2 Feb 2021 Hiroshi Sato, Tsubasa Ochiai, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Shoko Araki

Recently an audio-visual target speaker extraction has been proposed that extracts target speech by using complementary audio and visual clues.

Target Speaker Extraction

Speaker activity driven neural speech extraction

no code implementations14 Jan 2021 Marc Delcroix, Katerina Zmolikova, Tsubasa Ochiai, Keisuke Kinoshita, Tomohiro Nakatani

Target speech extraction, which extracts the speech of a target speaker in a mixture given auxiliary speaker clues, has recently received increased interest.

Speech Extraction

Neural Network-based Virtual Microphone Estimator

no code implementations12 Jan 2021 Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani, Rintaro Ikeshita, Keisuke Kinoshita, Shoko Araki

Developing microphone array technologies for a small number of microphones is important due to the constraints of many devices.

Speech Enhancement

Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds

no code implementations26 Oct 2020 Keisuke Kinoshita, Marc Delcroix, Naohiro Tawara

In this paper, we propose a simple but effective hybrid diarization framework that works with overlapped speech and for long recordings containing an arbitrary number of speakers.

Clustering

Listen to What You Want: Neural Network-based Universal Sound Selector

no code implementations10 Jun 2020 Tsubasa Ochiai, Marc Delcroix, Yuma Koizumi, Hiroaki Ito, Keisuke Kinoshita, Shoko Araki

In this paper, we propose instead a universal sound selection neural network that enables to directly select AE sounds from a mixture given user-specified target AE classes.

Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR

no code implementations4 Jun 2020 Thilo von Neumann, Christoph Boeddeker, Lukas Drude, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach

Most approaches to multi-talker overlapped speech separation and recognition assume that the number of simultaneously active speakers is given, but in realistic situations, it is typically unknown.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Tackling real noisy reverberant meetings with all-neural source separation, counting, and diarization system

no code implementations9 Mar 2020 Keisuke Kinoshita, Marc Delcroix, Shoko Araki, Tomohiro Nakatani

Automatic meeting analysis is an essential fundamental technology required to let, e. g. smart devices follow and respond to our conversations.

speaker-diarization Speaker Diarization +1

Improving speaker discrimination of target speech extraction with time-domain SpeakerBeam

1 code implementation23 Jan 2020 Marc Delcroix, Tsubasa Ochiai, Katerina Zmolikova, Keisuke Kinoshita, Naohiro Tawara, Tomohiro Nakatani, Shoko Araki

First, we propose a time-domain implementation of SpeakerBeam similar to that proposed for a time-domain audio separation network (TasNet), which has achieved state-of-the-art performance for speech separation.

Speaker Identification Speech Extraction

End-to-end training of time domain audio separation and recognition

no code implementations18 Dec 2019 Thilo von Neumann, Keisuke Kinoshita, Lukas Drude, Christoph Boeddeker, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach

The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multi-speaker speech recognition.

Speaker Recognition speech-recognition +2

Jointly optimal dereverberation and beamforming

no code implementations30 Oct 2019 Christoph Boeddeker, Tomohiro Nakatani, Keisuke Kinoshita, Reinhold Haeb-Umbach

We previously proposed an optimal (in the maximum likelihood sense) convolutional beamformer that can perform simultaneous denoising and dereverberation, and showed its superiority over the widely used cascade of a WPE dereverberation filter and a conventional MPDR beamformer.

Denoising

All-neural online source separation, counting, and diarization for meeting analysis

no code implementations21 Feb 2019 Thilo von Neumann, Keisuke Kinoshita, Marc Delcroix, Shoko Araki, Tomohiro Nakatani, Reinhold Haeb-Umbach

While significant progress has been made on individual tasks, this paper presents for the first time an all-neural approach to simultaneous speaker counting, diarization and source separation.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Cannot find the paper you are looking for? You can Submit a new open access paper.