no code implementations • 30 May 2023 • Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang, Wei Han, Ankur Bapna
The constituent samples of LibriTTS-R are identical to those of LibriTTS, with only the sound quality improved.
1 code implementation • 13 May 2023 • Kota Dohi, Keisuke Imoto, Noboru Harada, Daisuke Niizumi, Yuma Koizumi, Tomoya Nishida, Harsh Purohit, Ryo Tanabe, Takashi Endo, Yohei Kawaguchi
We present the task description of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge Task 2: "First-shot unsupervised anomalous sound detection (ASD) for machine condition monitoring".
no code implementations • 3 Mar 2023 • Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Yu Zhang, Wei Han, Ankur Bapna, Michiel Bacchiani
Experiments show that Miipher (i) is robust against various audio degradation and (ii) enable us to train a high-quality text-to-speech (TTS) model from restored speech samples collected from the Web.
no code implementations • 3 Oct 2022 • Yuma Koizumi, Kohei Yatabe, Heiga Zen, Michiel Bacchiani
The DDPMs and GANs can be characterized by the iterative denoising framework and adversarial training, respectively.
2 code implementations • 13 Jun 2022 • Kota Dohi, Keisuke Imoto, Noboru Harada, Daisuke Niizumi, Yuma Koizumi, Tomoya Nishida, Harsh Purohit, Takashi Endo, Masaaki Yamamoto, Yohei Kawaguchi
We present the task description and discussion on the results of the DCASE 2022 Challenge Task 2: ``Unsupervised anomalous sound detection (ASD) for machine condition monitoring applying domain generalization techniques''.
no code implementations • 26 Apr 2022 • Arun Narayanan, James Walker, Sankaran Panchapagesan, Nathan Howard, Yuma Koizumi
Using neural network based acoustic frontends for improving robustness of streaming automatic speech recognition (ASR) systems is challenging because of the causality constraints and the resulting distortion that the frontend processing introduces in speech.
no code implementations • 31 Mar 2022 • Yuma Koizumi, Heiga Zen, Kohei Yatabe, Nanxin Chen, Michiel Bacchiani
Neural vocoder using denoising diffusion probabilistic model (DDPM) has been improved by adaptation of the diffusion noise distribution to given acoustic features.
no code implementations • 1 Nov 2021 • Yuma Koizumi, Shigeki Karita, Arun Narayanan, Sankaran Panchapagesan, Michiel Bacchiani
Furthermore, by analyzing the predicted target SNRi, we observed the jointly trained network automatically controls the target SNRi according to noise characteristics.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 30 Jun 2021 • Yuma Koizumi, Shigeki Karita, Scott Wisdom, Hakan Erdogan, John R. Hershey, Llion Jones, Michiel Bacchiani
To make the model computationally feasible, we extend the Conformer using linear complexity attention and stacked 1-D dilated depthwise convolution layers.
4 code implementations • 8 Jun 2021 • Yohei Kawaguchi, Keisuke Imoto, Yuma Koizumi, Noboru Harada, Daisuke Niizumi, Kota Dohi, Ryo Tanabe, Harsh Purohit, Takashi Endo
In 2020, we organized an unsupervised anomalous sound detection (ASD) task, identifying whether a given sound was normal or anomalous without anomalous training data.
1 code implementation • 10 May 2021 • Koichi Saito, Tomohiko Nakamura, Kohei Yatabe, Yuma Koizumi, Hiroshi Saruwatari
Audio source separation is often used as preprocessing of various applications, and one of its ultimate goals is to construct a single versatile model capable of dealing with the varieties of audio signals.
no code implementations • 21 Jan 2021 • Takuya Fujimura, Yuma Koizumi, Kohei Yatabe, Ryoichi Miyazaki
This requirement currently restricts the amount of training data for speech enhancement to less than 1/1000 of that of speech recognition which does not need clean signals.
no code implementations • 14 Dec 2020 • Yuma Koizumi, Yasunori Ohishi, Daisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda
Then, the caption of the audio input is generated by using a pre-trained language model while referring to the guidance captions.
no code implementations • 24 Sep 2020 • Daiki Takeuchi, Yuma Koizumi, Yasunori Ohishi, Noboru Harada, Kunio Kashino
The system we used for Task 6 (Automated Audio Captioning)of the Detection and Classification of Acoustic Scenes and Events(DCASE) 2020 Challenge combines three elements, namely, dataaugmentation, multi-task learning, and post-processing, for audiocaptioning.
no code implementations • 1 Jul 2020 • Yuma Koizumi, Ryo Masumura, Kyosuke Nishida, Masahiro Yasuda, Shoichiro Saito
TRACKE estimates keywords, which comprise a word set corresponding to audio events/scenes in the input audio, and generates the caption while referring to the estimated keywords to reduce word-selection indeterminacy.
no code implementations • 1 Jul 2020 • Yuma Koizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino
This technical report describes the system participating to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge, Task 6: automated audio captioning.
Ranked #3 on
Audio captioning
on Clotho
no code implementations • 10 Jun 2020 • Tsubasa Ochiai, Marc Delcroix, Yuma Koizumi, Hiroaki Ito, Keisuke Kinoshita, Shoko Araki
In this paper, we propose instead a universal sound selection neural network that enables to directly select AE sounds from a mixture given user-specified target AE classes.
3 code implementations • 10 Jun 2020 • Yuma Koizumi, Yohei Kawaguchi, Keisuke Imoto, Toshiki Nakamura, Yuki Nikaido, Ryo Tanabe, Harsh Purohit, Kaori Suefusa, Takashi Endo, Masahiro Yasuda, Noboru Harada
The main challenge of this task is to detect unknown anomalous sounds under the condition that only normal sound samples have been provided as training data.
no code implementations • 14 Feb 2020 • Yuma Koizumi, Kohei Yatabe, Marc Delcroix, Yoshiki Masuyama, Daiki Takeuchi
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features; we extract a speaker representation used for adaptation directly from the test utterance.
no code implementations • 14 Feb 2020 • Masaki Kawanaka, Yuma Koizumi, Ryoichi Miyazaki, Kohei Yatabe
For evaluating the subjective quality, several methods related to perceptually-motivated objective sound quality assessment (OSQA) have been proposed such as PESQ (perceptual evaluation of speech quality).
no code implementations • 14 Feb 2020 • Yoshiki Masuyama, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada
In the proposed method, DNNs estimate phase derivatives instead of phase itself, which allows us to avoid the sensitivity problem.
1 code implementation • 25 Nov 2019 • Daiki Takeuchi, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada
Therefore, some end-to-end methods used a DNN to learn the linear T-F transform which is much easier to understand.
Audio and Speech Processing Sound
no code implementations • 10 Oct 2019 • Luca Mazzon, Yuma Koizumi, Masahiro Yasuda, Noboru Harada
The same transformation is applied also to the labels, in order to maintain consistency between input data and target labels.
no code implementations • 10 Oct 2019 • Masahiro Yasuda, Yuma Koizumi, Luca Mazzon, Shoichiro Saito, Hisashi Uematsu
We propose a direction of arrival (DOA) estimation method that combines sound-intensity vector (IV)-based DOA estimation and DNN-based denoising and dereverberation.
2 code implementations • 9 Aug 2019 • Yuma Koizumi, Shoichiro Saito, Hisashi Uematsu, Noboru Harada, Keisuke Imoto
To build a large-scale dataset for ADMOS, we collected anomalous operating sounds of miniature machines (toys) by deliberately damaging them.
no code implementations • 19 Jul 2019 • Yuma Koizumi, Shoichiro Saito, Masataka Yamaguchi, Shin Murata, Noboru Harada
The AE is trained to minimize the sample mean of the anomaly score of normal sounds in a mini-batch.
no code implementations • 10 Mar 2019 • Yoshiki Masuyama, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada
This paper presents a novel phase reconstruction method (only from a given amplitude spectrogram) by combining a signal-processing-based approach and a deep neural network (DNN).
no code implementations • 14 Dec 2018 • Masataka Yamaguchi, Yuma Koizumi, Noboru Harada
To address this difficulty, we propose AdaFlow, a new DNN-based density estimator that can be easily adapted to the change of the distribution.
no code implementations • 5 Nov 2018 • Yuma Koizumi, Noboru Harada, Yoichi Haneda
To overcome this problem, we incorporate AWS into the speech enhancement procedure, and the windowing function of each time-frame is manipulated using a DNN depending on the input signal.
no code implementations • 22 Oct 2018 • Yuma Koizumi, Kenta Niwa, Yusuke Hioka, Kazunori Kobayashi, Yoichi Haneda
Since OSQA scores have been used widely for sound-quality evaluation, constructing DNNs to increase OSQA scores would be better than using the minimum-MSE to create high-quality output signals.
1 code implementation • 22 Oct 2018 • Yuma Koizumi, Shoichiro Saito, Hisashi Uematsum Yuta Kawachi, Noboru Harada
To calculate the TPR in the objective function, we consider that the set of anomalous sounds is the complementary set of normal sounds and simulate anomalous sounds by using a rejection sampling algorithm.