no code implementations • 25 May 2023 • Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takanori Ashihara, Kohei Matsuura, Tomohiro Tanaka, Ryo Masumura, Atsunori Ogawa, Taichi Asami
Neural transducer (RNNT)-based target-speaker speech recognition (TS-RNNT) directly transcribes a target speaker's voice from a multi-talker mixture.
no code implementations • 24 May 2023 • Hiroshi Sato, Ryo Masumura, Tsubasa Ochiai, Marc Delcroix, Takafumi Moriya, Takanori Ashihara, Kentaro Shinayama, Saki Mizuno, Mana Ihori, Tomohiro Tanaka, Nobukatsu Hojo
In this work, we propose a new SE training criterion that minimizes the distance between clean and enhanced signals in the feature representation of the SSL model to alleviate the mismatch.
no code implementations • 23 May 2023 • Marc Delcroix, Naohiro Tawara, Mireia Diez, Federico Landini, Anna Silnova, Atsunori Ogawa, Tomohiro Nakatani, Lukas Burget, Shoko Araki
Combining end-to-end neural speaker diarization (EEND) with vector clustering (VC), known as EEND-VC, has gained interest for leveraging the strengths of both methods.
no code implementations • 2 Mar 2023 • Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Atsunori Ogawa, Marc Delcroix, Ryo Masumura
The first technique is to utilize a text-to-speech (TTS) system to generate synthesized speech, which is used for E2E SSum training with the text summary.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
1 code implementation • 31 Jan 2023 • Katerina Zmolikova, Marc Delcroix, Tsubasa Ochiai, Keisuke Kinoshita, Jan Černocký, Dong Yu
Humans can listen to a target speaker even in challenging acoustic conditions that have noise, reverberation, and interfering speakers.
1 code implementation • 29 Nov 2022 • Thilo von Neumann, Christoph Boeddeker, Keisuke Kinoshita, Marc Delcroix, Reinhold Haeb-Umbach
We present a general framework to compute the word error rate (WER) of ASR systems that process recordings containing multiple speakers at their input and that produce multiple output word sequences (MIMO).
no code implementations • 9 Sep 2022 • Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takahiro Shinozaki
We confirm in experiments that our TS-ASR achieves comparable recognition performance with conventional cascade systems in the offline setting, while reducing computation costs and realizing streaming TS-ASR.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
1 code implementation • 15 Aug 2022 • Ján Švec, Kateřina Žmolíková, Martin Kocour, Marc Delcroix, Tsubasa Ochiai, Ladislav Mošner, Jan Černocký
One of the factors causing such degradation may be intrinsic speaker variability, such as emotions, occurring commonly in realistic speech.
1 code implementation • 28 Jul 2022 • Keisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Christoph Boeddeker, Reinhold Haeb-Umbach
In this paper, we argue that such an approach involving the segmentation has several issues; for example, it inevitably faces a dilemma that larger segment sizes increase both the context available for enhancing the performance and the number of speakers for the local EEND module to handle.
no code implementations • 25 Jul 2022 • Yasunori Ohishi, Marc Delcroix, Tsubasa Ochiai, Shoko Araki, Daiki Takeuchi, Daisuke Niizumi, Akisato Kimura, Noboru Harada, Kunio Kashino
We use it to bridge modality-dependent information, i. e., the speech segments in the mixture, and the specified, modality-independent concept.
no code implementations • 16 Jun 2022 • Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Ryo Masumura
Experimental validation reveals the effectiveness of both worst-enrollment target training and SI-loss training to improve robustness against enrollment variations, by increasing speaker discriminability.
no code implementations • 7 May 2022 • Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani, Shoko Araki
We thus introduce a learning-based framework that computes optimal attention weights for beamforming.
no code implementations • 11 Apr 2022 • Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Katerina Zmolikova, Hiroshi Sato, Tomohiro Nakatani
Target speech extraction (TSE) extracts the speech of a target speaker in a mixture given auxiliary clues characterizing the speaker, such as an enrollment utterance.
no code implementations • 8 Apr 2022 • Marc Delcroix, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita, Yasunori Ohishi, Shoko Araki
We can achieve this with a neural network that extracts the target SEs by conditioning it on clues representing the target SE classes.
no code implementations • 14 Feb 2022 • Keisuke Kinoshita, Marc Delcroix, Tomoharu Iwata
Speaker diarization has been investigated extensively as an important central task for meeting analysis.
no code implementations • 18 Jan 2022 • Kazuma Iwamoto, Tsubasa Ochiai, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri
The artifact component is defined as the SE error signal that cannot be represented as a linear combination of speech and noise sources.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 11 Jan 2022 • Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Naoyuki Kamo, Takafumi Moriya
To mitigate the degradation, we introduced a rule-based method to switch the ASR input between the enhanced and observed signals, which showed promising results.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
2 code implementations • 16 Nov 2021 • Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Shinji Watanabe
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
1 code implementation • 31 Oct 2021 • Martin Kocour, Kateřina Žmolíková, Lucas Ondel, Ján Švec, Marc Delcroix, Tsubasa Ochiai, Lukáš Burget, Jan Černocký
We modify the acoustic model to predict joint state posteriors for all speakers, enabling the network to express uncertainty about the attribution of parts of the speech signal to the speakers.
no code implementations • 29 Oct 2021 • Thilo von Neumann, Keisuke Kinoshita, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach
Many state-of-the-art neural network-based source separation systems use the averaged Signal-to-Distortion Ratio (SDR) as a training objective function.
1 code implementation • 30 Jul 2021 • Thilo von Neumann, Christoph Boeddeker, Keisuke Kinoshita, Marc Delcroix, Reinhold Haeb-Umbach
The Hungarian algorithm can be used for uPIT and we introduce various algorithms for the Graph-PIT assignment problem to reduce the complexity to be polynomial in the number of utterances.
1 code implementation • 30 Jul 2021 • Thilo von Neumann, Keisuke Kinoshita, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach
When processing meeting-like data in a segment-wise manner, i. e., by separating overlapping segments independently and stitching adjacent segments to continuous output streams, this constraint has to be fulfilled for any segment.
no code implementations • 14 Jun 2021 • Marc Delcroix, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita, Shoko Araki
Target sound extraction consists of extracting the sound of a target acoustic event (AE) class from a mixture of AE sounds.
1 code implementation • 7 Jun 2021 • Christopher Schymura, Benedikt Bönninghoff, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa
Sound event localization aims at estimating the positions of sound sources in the environment with respect to an acoustic receiver (e. g. a microphone array).
no code implementations • 2 Jun 2021 • Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoyuki Kamo
', we analyze ASR performance on observed and enhanced speech at various noise and interference conditions, and show that speech enhancement degrades ASR under some conditions even for overlapping speech.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
1 code implementation • 19 May 2021 • Keisuke Kinoshita, Marc Delcroix, Naohiro Tawara
This paper is to (1) report recent advances we made to this framework, including newly introduced robust constrained clustering algorithms, and (2) experimentally show that the method can now significantly outperform competitive diarization methods such as Encoder-Decoder Attractor (EDA)-EEND, on CALLHOME data which comprises real conversational speech data including overlapped speech and an arbitrary number of speakers.
1 code implementation • 28 Feb 2021 • Christopher Schymura, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa
Herein, attentions allow for capturing temporal dependencies in the audio signal by focusing on specific frames that are relevant for estimating the activity and direction-of-arrival of sound events at the current time-step.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 23 Feb 2021 • Chenda Li, Zhuo Chen, Yi Luo, Cong Han, Tianyan Zhou, Keisuke Kinoshita, Marc Delcroix, Shinji Watanabe, Yanmin Qian
A transformer-based dual-path system is proposed, which integrates transform layers for global modeling.
no code implementations • 23 Feb 2021 • Wangyou Zhang, Christoph Boeddeker, Shinji Watanabe, Tomohiro Nakatani, Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Naoyuki Kamo, Reinhold Haeb-Umbach, Yanmin Qian
Recently, the end-to-end approach has been successfully applied to multi-speaker speech separation and recognition in both single-channel and multichannel conditions.
1 code implementation • 23 Feb 2021 • Julio Wissing, Benedikt Boenninghoff, Dorothea Kolossa, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Christopher Schymura
Estimating the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • 2 Feb 2021 • Hiroshi Sato, Tsubasa Ochiai, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Shoko Araki
Recently an audio-visual target speaker extraction has been proposed that extracts target speech by using complementary audio and visual clues.
no code implementations • 14 Jan 2021 • Marc Delcroix, Katerina Zmolikova, Tsubasa Ochiai, Keisuke Kinoshita, Tomohiro Nakatani
Target speech extraction, which extracts the speech of a target speaker in a mixture given auxiliary speaker clues, has recently received increased interest.
no code implementations • 12 Jan 2021 • Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani, Rintaro Ikeshita, Keisuke Kinoshita, Shoko Araki
Developing microphone array technologies for a small number of microphones is important due to the constraints of many devices.
no code implementations • 17 Dec 2020 • Cong Han, Yi Luo, Chenda Li, Tianyan Zhou, Keisuke Kinoshita, Shinji Watanabe, Marc Delcroix, Hakan Erdogan, John R. Hershey, Nima Mesgarani, Zhuo Chen
Leveraging additional speaker information to facilitate speech separation has received increasing attention in recent years.
1 code implementation • 24 Nov 2020 • Katerina Zmolikova, Marc Delcroix, Lukáš Burget, Tomohiro Nakatani, Jan "Honza" Černocký
In this paper, we propose a method combining variational autoencoder model of speech with a spatial clustering approach for multi-channel speech separation.
Audio and Speech Processing
no code implementations • 26 Oct 2020 • Keisuke Kinoshita, Marc Delcroix, Naohiro Tawara
In this paper, we propose a simple but effective hybrid diarization framework that works with overlapped speech and for long recordings containing an arbitrary number of speakers.
no code implementations • 10 Jun 2020 • Tsubasa Ochiai, Marc Delcroix, Yuma Koizumi, Hiroaki Ito, Keisuke Kinoshita, Shoko Araki
In this paper, we propose instead a universal sound selection neural network that enables to directly select AE sounds from a mixture given user-specified target AE classes.
no code implementations • 4 Jun 2020 • Thilo von Neumann, Christoph Boeddeker, Lukas Drude, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach
Most approaches to multi-talker overlapped speech separation and recognition assume that the number of simultaneously active speakers is given, but in realistic situations, it is typically unknown.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 9 Mar 2020 • Keisuke Kinoshita, Marc Delcroix, Shoko Araki, Tomohiro Nakatani
Automatic meeting analysis is an essential fundamental technology required to let, e. g. smart devices follow and respond to our conversations.
no code implementations • 9 Mar 2020 • Keisuke Kinoshita, Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani
With the advent of deep learning, research on noise-robust automatic speech recognition (ASR) has progressed rapidly.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • 14 Feb 2020 • Yuma Koizumi, Kohei Yatabe, Marc Delcroix, Yoshiki Masuyama, Daiki Takeuchi
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features; we extract a speaker representation used for adaptation directly from the test utterance.
1 code implementation • 23 Jan 2020 • Marc Delcroix, Tsubasa Ochiai, Katerina Zmolikova, Keisuke Kinoshita, Naohiro Tawara, Tomohiro Nakatani, Shoko Araki
First, we propose a time-domain implementation of SpeakerBeam similar to that proposed for a time-domain audio separation network (TasNet), which has achieved state-of-the-art performance for speech separation.
no code implementations • 18 Dec 2019 • Thilo von Neumann, Keisuke Kinoshita, Lukas Drude, Christoph Boeddeker, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach
The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multi-speaker speech recognition.
no code implementations • 21 Feb 2019 • Thilo von Neumann, Keisuke Kinoshita, Marc Delcroix, Shoko Araki, Tomohiro Nakatani, Reinhold Haeb-Umbach
While significant progress has been made on individual tasks, this paper presents for the first time an all-neural approach to simultaneous speaker counting, diarization and source separation.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3