As the refiner, we train a diffusion-based generative model by utilizing a dataset consisting of clean speech only.
In this paper we propose a novel generative approach, DiffRoll, to tackle automatic music transcription (AMT).
2 code implementations • 4 Jun 2022 • Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen
Additionally, the report presents the baseline system that accompanies the dataset in the challenge with emphasis on the differences with the baseline of the previous iterations; namely, introduction of the multi-ACCDOA representation to handle multiple simultaneous occurences of events of the same class, and support for additional improved input features for the microphone array format.
Ranked #1 on Sound Event Localization and Detection on STARSS22
1 code implementation • 16 May 2022 • Yuhta Takida, Takashi Shibuya, WeiHsiang Liao, Chieh-Hsin Lai, Junki Ohmura, Toshimitsu Uesaka, Naoki Murata, Shusuke Takahashi, Toshiyuki Kumakura, Yuki Mitsufuji
In this paper, we propose a new training scheme that extends the standard VAE via novel stochastic dequantization and quantization, called stochastically quantized variational autoencoder (SQ-VAE).
The multi- ACCDOA format (a class- and track-wise output format) enables the model to solve the cases with overlaps from the same class.
Data augmentation methods have shown great importance in diverse supervised learning problems where labeled data is scarce or costly to obtain.
In order to optimize the DNN-based SE model in terms of the character error rate (CER), which is one of the metric to evaluate the ASR system and generally non-differentiable, our method uses two DNNs: one for speech processing and one for mimicking the output CERs derived through an acoustic model (AM).
This report describes our systems submitted to the DCASE2021 challenge task 3: sound event localization and detection (SELD) with directional interference.
This paper presents a new deep clustering (DC) method called manifold-aware DC (M-DC) that can enhance hyperspace utilization more effectively than the original DC.
Variational autoencoders (VAEs) often suffer from posterior collapse, which is a phenomenon in which the learned latent space becomes uninformative.
Variational autoencoders (VAEs) often suffer from posterior collapse, which is a phenomenon that the learned latent space becomes uninformative.
Conventional NN-based methods use two branches for a sound event detection (SED) target and a direction-of-arrival (DOA) target.
This paper proposes several improvements for music separation with deep neural networks (DNNs), namely a multi-domain loss (MDL) and two combination schemes.
Ranked #18 on Music Source Separation on MUSDB18