no code implementations • 16 Oct 2024 • Shota Horiguchi, Takafumi Moriya, Atsushi Ando, Takanori Ashihara, Hiroshi Sato, Naohiro Tawara, Marc Delcroix
This paper proposes a guided speaker embedding extraction system, which extracts speaker embeddings of the target speaker using speech activities of target and interference speakers as clues.
no code implementations • 15 Oct 2024 • Takanori Ashihara, Takafumi Moriya, Shota Horiguchi, Junyi Peng, Tsubasa Ochiai, Marc Delcroix, Kohei Matsuura, Hiroshi Sato
To this end, for the TS-ASR, TSE, and p-VAD tasks, we compare pre-trained speaker encoders (i. e., self-supervised or speaker recognition models) that compute speaker embeddings from pre-recorded enrollment speech of the target speaker with ideal speaker embeddings derived directly from the target speaker's identity in the form of a one-hot vector.
no code implementations • 30 Sep 2024 • Takafumi Moriya, Shota Horiguchi, Marc Delcroix, Ryo Masumura, Takanori Ashihara, Hiroshi Sato, Kohei Matsuura, Masato Mimura
Thus, MT-RNNT-AFT can be trained without relying on accurate alignments, and it can recognize all speakers' speech with just one round of encoder processing.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 9 Sep 2024 • Naoyuki Kamo, Naohiro Tawara, Atsushi Ando, Takatomo Kano, Hiroshi Sato, Rintaro Ikeshita, Takafumi Moriya, Shota Horiguchi, Kohei Matsuura, Atsunori Ogawa, Alexis Plaquet, Takanori Ashihara, Tsubasa Ochiai, Masato Mimura, Marc Delcroix, Tomohiro Nakatani, Taichi Asami, Shoko Araki
We present a distant automatic speech recognition (DASR) system developed for the CHiME-8 DASR track.
no code implementations • 30 Aug 2024 • Shota Horiguchi, Atsushi Ando, Takafumi Moriya, Takanori Ashihara, Hiroshi Sato, Naohiro Tawara, Marc Delcroix
This paper proposes a method for extracting speaker embedding for each speaker from a variable-length recording containing multiple speakers.
no code implementations • 1 Jul 2024 • Hiroshi Sato, Takafumi Moriya, Masato Mimura, Shota Horiguchi, Tsubasa Ochiai, Takanori Ashihara, Atsushi Ando, Kentaro Shinayama, Marc Delcroix
Real-time target speaker extraction (TSE) is intended to extract the desired speaker's voice from the observed mixture of multiple speakers in a streaming manner.
no code implementations • 27 Jun 2024 • Atsushi Ando, Takafumi Moriya, Shota Horiguchi, Ryo Masumura
We also propose greedy-then-sampling (GtS) decoding, which first predicts speaking-style factors deterministically to guarantee semantic accuracy, and then generates a caption based on factor-conditioned sampling to ensure diversity.
no code implementations • 13 Feb 2024 • Hiroyuki Namba, Shota Horiguchi, Masaki Hamamoto, Masashi Egi
Data cleansing aims to improve model performance by removing a set of harmful instances from the training dataset.
no code implementations • 2 Sep 2023 • Shota Horiguchi, Kota Dohi, Yohei Kawaguchi
One of the challenges in deploying a machine learning model is that the model's performance degrades as the operating environment changes.
no code implementations • 24 May 2023 • Aoi Ito, Shota Horiguchi
Large-scale pretrained models using self-supervised learning have reportedly improved the performance of speech anti-spoofing.
no code implementations • 7 Oct 2022 • Shota Horiguchi, Yuki Takashima, Shinji Watanabe, Paola Garcia
This paper focuses on speaker diarization and proposes to conduct the above bi-directional knowledge transfer alternately.
no code implementations • 1 Jul 2022 • Yuki Takashima, Shota Horiguchi, Shinji Watanabe, Paola García, Yohei Kawaguchi
In this paper, we present an incremental domain adaptation technique to prevent catastrophic forgetting for an end-to-end automatic speech recognition (ASR) model.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 6 Jun 2022 • Shota Horiguchi, Shinji Watanabe, Paola Garcia, Yuki Takashima, Yohei Kawaguchi
Finally, to improve online diarization, our method improves the buffer update method and revisits the variable chunk-size training of EEND.
1 code implementation • 25 May 2022 • Terufumi Morishita, Gaku Morio, Shota Horiguchi, Hiroaki Ozaki, Nobuo Nukaga
We propose a fundamental theory on ensemble learning that answers the central question: what factors make an ensemble system good or bad?
1 code implementation • 24 Apr 2022 • Natsuo Yamashita, Shota Horiguchi, Takeshi Homma
Due to the lack of any annotated real conversational dataset, EEND is usually pretrained on a large-scale simulated conversational dataset first and then adapted to the target real dataset.
no code implementations • 1 Dec 2021 • Yuki Okamoto, Shota Horiguchi, Masaaki Yamamoto, Keisuke Imoto, Yohei Kawaguchi
An onomatopoeic word, which is a character sequence that phonetically imitates a sound, is effective in expressing characteristics of sound such as duration, pitch, and timbre.
no code implementations • 10 Oct 2021 • Shota Horiguchi, Yuki Takashima, Paola Garcia, Shinji Watanabe, Yohei Kawaguchi
With simulated and real-recorded datasets, we demonstrated that the proposed method outperformed conventional EEND when a multi-channel input was given while maintaining comparable performance with a single-channel input.
no code implementations • 4 Jul 2021 • Shota Horiguchi, Shinji Watanabe, Paola Garcia, Yawen Xue, Yuki Takashima, Yohei Kawaguchi
This makes it possible to produce diarization results of a large number of speakers for the whole recording even if the number of output speakers for each subsequence is limited.
1 code implementation • 20 Jun 2021 • Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, Paola Garcia
Diarization results are then estimated as dot products of the attractors and embeddings.
no code implementations • 9 Jun 2021 • Yuki Takashima, Yusuke Fujita, Shota Horiguchi, Shinji Watanabe, Paola García, Kenji Nagamatsu
To evaluate our proposed method, we conduct the experiments of model adaptation using labeled and unlabeled data.
no code implementations • 8 Jun 2021 • Yuki Takashima, Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Paola García, Kenji Nagamatsu
In this paper, we present a conditional multitask learning method for end-to-end neural speaker diarization (EEND).
no code implementations • 2 Feb 2021 • Shota Horiguchi, Nelson Yalta, Paola Garcia, Yuki Takashima, Yawen Xue, Desh Raj, Zili Huang, Yusuke Fujita, Shinji Watanabe, Sanjeev Khudanpur
This paper provides a detailed description of the Hitachi-JHU system that was submitted to the Third DIHARD Speech Diarization Challenge.
1 code implementation • 21 Jan 2021 • Yawen Xue, Shota Horiguchi, Yusuke Fujita, Yuki Takashima, Shinji Watanabe, Paola Garcia, Kenji Nagamatsu
We propose a streaming diarization method based on an end-to-end neural diarization (EEND) model, which handles flexible numbers of speakers and overlapping speech.
Speaker Diarization
Sound
Audio and Speech Processing
no code implementations • 18 Dec 2020 • Shota Horiguchi, Paola Garcia, Yusuke Fujita, Shinji Watanabe, Kenji Nagamatsu
Clustering-based diarization methods partition frames into clusters of the number of speakers; thus, they typically cannot handle overlapping speech because each frame is assigned to one speaker.
no code implementations • SEMEVAL 2020 • Terufumi Morishita, Gaku Morio, Shota Horiguchi, Hiroaki Ozaki, Toshinori Miyoshi
Users of social networking services often share their emotions via multi-modal content, usually images paired with text embedded in them.
no code implementations • 16 Nov 2020 • Shota Horiguchi, Yusuke Fujita, Kenji Nagamatsu
It is also a problem that the offline GSS is an utterance-wise algorithm so that it produces latency according to the length of the utterance.
no code implementations • 31 Jul 2020 • Shota Horiguchi, Yusuke Fujita, Kenji Nagamatsu
We also showed that our framework achieved CER of 21. 8 %, which is only 2. 1 percentage points higher than the CER in headset microphone-based transcription.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
no code implementations • 4 Jun 2020 • Yawen Xue, Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Kenji Nagamatsu
This paper proposes a novel online speaker diarization algorithm based on a fully supervised self-attention mechanism (SA-EEND).
1 code implementation • 2 Jun 2020 • Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, Jing Shi, Kenji Nagamatsu
Speaker diarization is an essential step for processing multi-speaker audio.
3 code implementations • 20 May 2020 • Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, Kenji Nagamatsu
End-to-end speaker diarization for an unknown number of speakers is addressed in this paper.
no code implementations • 20 Apr 2020 • Shinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vincent, Ashish Arora, Xuankai Chang, Sanjeev Khudanpur, Vimal Manohar, Daniel Povey, Desh Raj, David Snyder, Aswin Shanmugam Subramanian, Jan Trmal, Bar Ben Yair, Christoph Boeddeker, Zhaoheng Ni, Yusuke Fujita, Shota Horiguchi, Naoyuki Kanda, Takuya Yoshioka, Neville Ryant
Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6).
1 code implementation • 24 Feb 2020 • Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, Kenji Nagamatsu
However, the clustering-based approach has a number of problems; i. e., (i) it is not optimized to minimize diarization errors directly, (ii) it cannot handle speaker overlaps correctly, and (iii) it has trouble adapting their speaker embedding models to real audio recordings with speaker overlaps.
no code implementations • 17 Sep 2019 • Naoyuki Kanda, Shota Horiguchi, Yusuke Fujita, Yawen Xue, Kenji Nagamatsu, Shinji Watanabe
Our proposed method combined with i-vector speaker embeddings ultimately achieved a WER that differed by only 2. 1 % from that of TS-ASR given oracle speaker embeddings.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
2 code implementations • 13 Sep 2019 • Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Yawen Xue, Kenji Nagamatsu, Shinji Watanabe
Our method was even better than that of the state-of-the-art x-vector clustering-based method.
Ranked #2 on
Speaker Diarization
on CALLHOME
1 code implementation • 12 Sep 2019 • Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Kenji Nagamatsu, Shinji Watanabe
To realize such a model, we formulate the speaker diarization problem as a multi-label classification problem, and introduces a permutation-free objective function to directly minimize diarization errors without being suffered from the speaker-label permutation problem.
Ranked #6 on
Speaker Diarization
on CALLHOME
no code implementations • 26 Jun 2019 • Naoyuki Kanda, Shota Horiguchi, Ryoichi Takashima, Yusuke Fujita, Kenji Nagamatsu, Shinji Watanabe
In this paper, we propose a novel auxiliary loss function for target-speaker automatic speech recognition (ASR).
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
1 code implementation • 29 May 2019 • Naoyuki Kanda, Christoph Boeddeker, Jens Heitkaemper, Yusuke Fujita, Shota Horiguchi, Kenji Nagamatsu, Reinhold Haeb-Umbach
In this paper, we present Hitachi and Paderborn University's joint effort for automatic speech recognition (ASR) in a dinner party scenario.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • 8 Apr 2018 • Shota Horiguchi, Sosuke Amano, Makoto Ogawa, Kiyoharu Aizawa
In this paper, we address the personalization problem, which involves adapting to the user's domain incrementally using a very limited number of samples.
no code implementations • 29 Dec 2017 • Shota Horiguchi, Daiki Ikami, Kiyoharu Aizawa
However, in these DML studies, there were no equitable comparisons between features extracted from a DML-based network and those from a softmax-based network.