no code implementations • 29 Jan 2017 • Eita Nakamura, Kazuyoshi Yoshii, Shigeki Sagayama
In a recent conference paper, we have reported a rhythm transcription method based on a merged-output hidden Markov model (HMM) that explicitly describes the multiple-voice structure of polyphonic music.
no code implementations • 23 Mar 2017 • Eita Nakamura, Kazuyoshi Yoshii, Simon Dixon
This paper presents a statistical method for use in music transcription that can estimate score times of note onsets and offsets from polyphonic MIDI performance signals.
no code implementations • 7 Aug 2017 • Hiroaki Tsushima, Eita Nakamura, Katsutoshi Itoyama, Kazuyoshi Yoshii
Generative statistical models of chord sequences play crucial roles in music processing.
no code implementations • 31 Oct 2017 • Yoshiaki Bando, Masato Mimura, Katsutoshi Itoyama, Kazuyoshi Yoshii, Tatsuya Kawahara
This paper presents a statistical method of single-channel speech enhancement that uses a variational autoencoder (VAE) as a prior distribution on clean speech.
no code implementations • 15 Aug 2018 • Eita Nakamura, Kazuyoshi Yoshii
We present a statistical-modelling method for piano reduction, i. e. converting an ensemble score into piano scores, that can control performance difficulty.
no code implementations • 8 Mar 2019 • Aditya Arie Nugraha, Kouhei Sekiguchi, Kazuyoshi Yoshii
To improve the consistency of the phase values in the time-frequency domain, we also apply the von Mises distribution to the phase derivatives, i. e., the group delay and the instantaneous frequency.
2 code implementations • European Association for Signal Processing (EUSIPCO) 2019 • Kouhei Sekiguchi, Aditya Arie Nugraha, Yoshiaki Bando, Kazuyoshi Yoshii
A popular approach to multichannel source separation is to integrate a spatial model with a source model for estimating the spatial covariance matrices (SCMs) and power spectral densities (PSDs) of each sound source in the time-frequency domain.
no code implementations • 22 Mar 2019 • Kazuki Shimada, Yoshiaki Bando, Masato Mimura, Katsutoshi Itoyama, Kazuyoshi Yoshii, Tatsuya Kawahara
To solve this problem, we take an unsupervised approach that decomposes each TF bin into the sum of speech and noise by using multichannel nonnegative matrix factorization (MNMF).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 23 Apr 2019 • Eita Nakamura, Yasuyuki Saito, Kazuyoshi Yoshii
We find that the methods based on high-order HMMs outperform the other methods in terms of estimation accuracies.
no code implementations • 18 Aug 2019 • Eita Nakamura, Kazuyoshi Yoshii
Focusing on rhythm, we formulate several classes of Bayesian Markov models of musical scores that describe repetitions indirectly using the sparse transition probabilities of notes or note patterns.
no code implementations • 29 Aug 2019 • Yoshiaki Bando, Yoko SASAKI, Kazuyoshi Yoshii
This paper presents an unsupervised method that trains neural source separation by using only multichannel mixture signals.
1 code implementation • IEEE/ACM Transactions on Audio, Speech, and Language Processing 2019 • Kouhei Sekiguchi, Yoshiaki Bando, Aditya Arie Nugraha, Kazuyoshi Yoshii, Tatsuya Kawahara
To solve this problem, we replace a low-rank speech model with a deep generative speech model, i. e., formulate a probabilistic model of noisy speech by integrating a deep speech model, a low-rank noise model, and a full-rank or rank-1 model of spatial characteristics of speech and noise.
1 code implementation • 12 Nov 2019 • Tristan Carsault, Andrew McLeod, Philippe Esling, Jérôme Nika, Eita Nakamura, Kazuyoshi Yoshii
In this paper, we postulate that this comes from the multi-scale structure of musical information and propose new architectures based on an iterative temporal aggregation of input labels.
no code implementations • 8 Apr 2020 • Takayuki Nakatsuka, Kazuyoshi Yoshii, Yuki Koyama, Satoru Fukayama, Masataka Goto, Shigeo Morishima
Specifically, we formulate a hierarchical generative model of poses and images by integrating a deep generative model of poses from pose features with that of images from poses and image features.
1 code implementation • 14 May 2020 • Yiming Wu, Tristan Carsault, Eita Nakamura, Kazuyoshi Yoshii
In contrast, we propose a unified generative and discriminative approach in the framework of amortized variational inference.
1 code implementation • 27 Aug 2020 • Jeongwoo Woo, Masato Mimura, Kazuyoshi Yoshii, Tatsuya Kawahara
The time-domain separation method outperformed a frequency-domain separation method, which reuses the phase information of the input mixture signal, both in simple cascading and joint training settings.
Audio and Speech Processing
1 code implementation • 30 Sep 2020 • Andrew McLeod, James Owers, Kazuyoshi Yoshii
To that end, MDTK includes a script that measures the distribution of different types of errors in a transcription, and creates a degraded dataset with similar properties.
no code implementations • 8 Oct 2020 • Ryoto Ishizuka, Ryo Nishikimi, Eita Nakamura, Kazuyoshi Yoshii
This paper describes a neural drum transcription method that detects from music signals the onset times of drums at the $\textit{tatum}$ level, where tatum times are assumed to be estimated in advance.
no code implementations • 12 May 2021 • Ryoto Ishizuka, Ryo Nishikimi, Kazuyoshi Yoshii
To mitigate the difficulty of training the self-attention-based model from an insufficient amount of paired data and improve the musical naturalness of the estimated scores, we propose a regularized training method that uses a global structure-aware masked language (score) model with a self-attention mechanism pretrained from an extensive collection of drum scores.
no code implementations • 11 May 2022 • Mathieu Fontaine, Kouhei Sekiguchi, Aditya Nugraha, Yoshiaki Bando, Kazuyoshi Yoshii
This paper describes heavy-tailed extensions of a state-of-the-art versatile blind source separation method called fast multichannel nonnegative matrix factorization (FastMNMF) from a unified point of view.
1 code implementation • 15 Jul 2022 • Kouhei Sekiguchi, Aditya Arie Nugraha, Yicheng Du, Yoshiaki Bando, Mathieu Fontaine, Kazuyoshi Yoshii
This paper describes the practical response- and performance-aware development of online speech enhancement for an augmented reality (AR) headset that helps a user understand conversations made in real noisy echoic environments (e. g., cocktail party).
no code implementations • 15 Jul 2022 • Yicheng Du, Aditya Arie Nugraha, Kouhei Sekiguchi, Yoshiaki Bando, Mathieu Fontaine, Kazuyoshi Yoshii
This paper describes noisy speech recognition for an augmented reality headset that helps verbal communication within real multiparty conversational environments.
Ranked #1 on Speech Enhancement on EasyCom (SDR metric)
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 22 Jul 2022 • Aditya Arie Nugraha, Kouhei Sekiguchi, Mathieu Fontaine, Yoshiaki Bando, Kazuyoshi Yoshii
Our DNN-free system leverages the posteriors of the latest source spectrograms given by block-online FastMNMF to derive the current source covariance matrices for frame-online beamforming.
no code implementations • 8 May 2023 • Diego Di Carlo, Aditya Arie Nugraha, Mathieu Fontaine, Kazuyoshi Yoshii
We address the problem of accurately interpolating measured anechoic steering vectors with a deep learning framework called the neural field.
no code implementations • 17 Jun 2023 • Yoshiaki Bando, Yoshiki Masuyama, Aditya Arie Nugraha, Kazuyoshi Yoshii
Our neural separation model introduced for AVI alternately performs neural network blocks and single steps of an efficient iterative algorithm called iterative source steering.