Search Results for author: Yuki Mitsufuji

Found 64 papers, 33 papers with code

SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

no code implementations • 28 May 2024 • Koichi Saito, Dongjun Kim, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong, Yuhta Takida, Yuki Mitsufuji

Furthermore, we demonstrate SoundCTM's capability of controllable sound generation in a training-free manner.

Paper
Add Code

Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning

1 code implementation • 28 May 2024 • Yixiao Zhang, Yukara Ikemiya, Woosung Choi, Naoki Murata, Marco A. Martínez-Ramírez, Liwei Lin, Gus Xia, Wei-Hsiang Liao, Yuki Mitsufuji, Simon Dixon

Recent advances in text-to-music editing, which employ text queries to modify music (e. g.\ by changing its style or adjusting instrumental components), present unique challenges and opportunities for AI-assisted music creation.

Paper
Code

Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

no code implementations • 28 May 2024 • Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

We theoretically show that this guidance can be computed through the gradient of the optimal discriminator distinguishing real audio-video pairs from fake ones independently generated by the base models.

Paper
Add Code

GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

no code implementations • 27 May 2024 • Junyoung Seo, Kazumi Fukuda, Takashi Shibuya, Takuya Narihira, Naoki Murata, Shoukang Hu, Chieh-Hsin Lai, Seungryong Kim, Yuki Mitsufuji

In these methods, an input view is geometrically warped to novel views with estimated depth maps, then the warped image is inpainted by T2I models.

Paper
Add Code

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

no code implementations • 23 May 2024 • Shiqi Yang, Zhi Zhong, Mengjie Zhao, Shusuke Takahashi, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

The recent audio-visual generation methods usually resort to huge large language model or composable diffusion models.

Audio Generation Denoising +2

Paper
Add Code

PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher

no code implementations • 23 May 2024 • Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon

In this approach, the resolution of the generator is fundamentally limited by that of the teacher DM.

Ranked #1 on Image Generation on ImageNet 32x32

Decoder Image Generation

Paper
Add Code

Understanding Multimodal Contrastive Learning Through Pointwise Mutual Information

no code implementations • 30 Apr 2024 • Toshimitsu Uesaka, Taiji Suzuki, Yuhta Takida, Chieh-Hsin Lai, Naoki Murata, Yuki Mitsufuji

Multimodal representation learning to integrate different modalities, such as text, vision, and audio is important for real-world applications.

Classification Contrastive Learning +2

Paper
Add Code

Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation

no code implementations • 28 Mar 2024 • Yutong He, Alexander Robey, Naoki Murata, Yiding Jiang, Joshua Williams, George J. Pappas, Hamed Hassani, Yuki Mitsufuji, Ruslan Salakhutdinov, J. Zico Kolter

Prompt engineering is effective for controlling the output of text-to-image (T2I) generative models, but it is also laborious due to the need for manually crafted prompts.

In-Context Learning Language Modelling +3

Paper
Add Code

MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage

1 code implementation • 15 Mar 2024 • Hao Hao Tan, Kin Wai Cheuk, Taemin Cho, Wei-Hsiang Liao, Yuki Mitsufuji

This paper presents enhancements to the MT3 model, a state-of-the-art (SOTA) token-based multi-instrument automatic music transcription (AMT) model.

Music Transcription

Paper
Code

DiffuCOMET: Contextual Commonsense Knowledge Diffusion

1 code implementation • 26 Feb 2024 • Silin Gao, Mete Ismayilzada, Mengjie Zhao, Hiromi Wakaki, Yuki Mitsufuji, Antoine Bosselut

Inferring contextually-relevant and diverse commonsense to understand narratives remains challenging for knowledge models.

Paper
Code

MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models

no code implementations • 9 Feb 2024 • Yixiao Zhang, Yukara Ikemiya, Gus Xia, Naoki Murata, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Yuki Mitsufuji, Simon Dixon

This paper introduces a novel approach to the editing of music generated by such models, enabling the modification of specific attributes, such as genre, mood and instrument, while maintaining other aspects unchanged.

Music Generation Text-to-Music Generation

Paper
Add Code

HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

no code implementations • 31 Dec 2023 • Yuhta Takida, Yukara Ikemiya, Takashi Shibuya, Kazuki Shimada, Woosung Choi, Chieh-Hsin Lai, Naoki Murata, Toshimitsu Uesaka, Kengo Uchida, Wei-Hsiang Liao, Yuki Mitsufuji

Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations.

Quantization Representation Learning

Paper
Add Code

Manifold Preserving Guided Diffusion

no code implementations • 28 Nov 2023 • Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao, Yuki Mitsufuji, J. Zico Kolter, Ruslan Salakhutdinov, Stefano Ermon

Despite the recent advancements, conditional image generation still faces challenges of cost, generalizability, and the need for task-specific training.

Conditional Image Generation

Paper
Add Code

On the Language Encoder of Contrastive Cross-modal Models

no code implementations • 20 Oct 2023 • Mengjie Zhao, Junya Ono, Zhi Zhong, Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Wei-Hsiang Liao, Takashi Shibuya, Hiromi Wakaki, Yuki Mitsufuji

Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks.

Sentence Sentence Embedding +1

Paper
Add Code

Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association

no code implementations • 2 Oct 2023 • Qiyu Wu, Mengjie Zhao, Yutong He, Lang Huang, Junya Ono, Hiromi Wakaki, Yuki Mitsufuji

In this paper, we focus on the wide existence of reporting bias in visual-language datasets, embodied as the object-attribute association, which can subsequentially degrade models trained on them.

Attribute Object

Paper
Add Code

Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion

1 code implementation • 1 Oct 2023 • Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, Stefano Ermon

Consistency Models (CM) (Song et al., 2023) accelerate score-based diffusion model sampling at the cost of sample quality but lack a natural way to trade-off quality for speed.

Ranked #3 on Image Generation on CIFAR-10

Denoising Image Generation

186

Paper
Code

Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription

no code implementations • 27 Sep 2023 • Frank Cwitkowitz, Kin Wai Cheuk, Woosung Choi, Marco A. Martínez-Ramírez, Keisuke Toyama, Wei-Hsiang Liao, Yuki Mitsufuji

Several works have explored multi-instrument transcription as a means to bolster the performance of models on low-resource tasks, but these methods face the same data availability issues.

Music Transcription

Paper
Add Code

VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance

no code implementations • 13 Sep 2023 • Carlos Hernandez-Olivan, Koichi Saito, Naoki Murata, Chieh-Hsin Lai, Marco A. Martínez-Ramirez, Wei-Hsiang Liao, Yuki Mitsufuji

Restoring degraded music signals is essential to enhance audio quality for downstream music manipulation.

Bandwidth Extension

Paper
Add Code

BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

2 code implementations • 6 Sep 2023 • Takashi Shibuya, Yuhta Takida, Yuki Mitsufuji

In the literature, it has been demonstrated that slicing adversarial network (SAN), an improved GAN training framework that can find the optimal projection, is effective in the image generation task.

Ranked #2 on Speech Synthesis on LibriTTS

Generative Adversarial Network Speech Synthesis

184

Paper
Code

Enhancing Semantic Communication with Deep Generative Models -- An ICASSP Special Session Overview

no code implementations • 5 Sep 2023 • Eleonora Grassucci, Yuki Mitsufuji, Ping Zhang, Danilo Comminiello

Semantic communication is poised to play a pivotal role in shaping the landscape of future AI-driven communication systems.

Paper
Add Code

The Sound Demixing Challenge 2023 $\unicode{x2013}$ Music Demixing Track

2 code implementations • 14 Aug 2023 • Giorgio Fabbro, Stefan Uhlich, Chieh-Hsin Lai, Woosung Choi, Marco Martínez-Ramírez, WeiHsiang Liao, Igor Gadelha, Geraldo Ramos, Eddie Hsu, Hugo Rodrigues, Fabian-Robert Stöter, Alexandre Défossez, Yi Luo, Jianwei Yu, Dipam Chakraborty, Sharada Mohanty, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Nabarun Goswami, Tatsuya Harada, Minseok Kim, Jun Hyung Lee, Yuanliang Dong, Xinran Zhang, Jiafeng Liu, Yuki Mitsufuji

We propose a formalization of the errors that can occur in the design of a training dataset for MSS systems and introduce two new datasets that simulate such errors: SDXDB23_LabelNoise and SDXDB23_Bleeding.

Music Source Separation

524

Paper
Code

The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track

1 code implementation • 14 Aug 2023 • Stefan Uhlich, Giorgio Fabbro, Masato Hirano, Shusuke Takahashi, Gordon Wichern, Jonathan Le Roux, Dipam Chakraborty, Sharada Mohanty, Kai Li, Yi Luo, Jianwei Yu, Rongzhi Gu, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Mikhail Sukhovei, Yuki Mitsufuji

A significant source of this improvement was making the simulated data better match real cinematic audio, which we further investigate in detail.

Paper
Code

Automatic Piano Transcription with Hierarchical Frequency-Time Transformer

1 code implementation • 10 Jul 2023 • Keisuke Toyama, Taketo Akama, Yukara Ikemiya, Yuhta Takida, Wei-Hsiang Liao, Yuki Mitsufuji

This is especially helpful when determining the precise onset and offset for each note in the polyphonic piano content.

Decoder Music Transcription

Paper
Code

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

1 code implementation • NeurIPS 2023 • Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, Yuki Mitsufuji

While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e. g., sounds of footsteps come from the feet of a walker.

Sound Event Localization and Detection

Paper
Code

On the Equivalence of Consistency-Type Models: Consistency Models, Consistent Diffusion Models, and Fokker-Planck Regularization

no code implementations • 1 Jun 2023 • Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Naoki Murata, Yuki Mitsufuji, Stefano Ermon

The emergence of various notions of ``consistency'' in diffusion models has garnered considerable attention and helped achieve improved sample quality, likelihood estimation, and accelerated sampling.

Paper
Add Code

Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

no code implementations • 18 May 2023 • Hao Shi, Kazuki Shimada, Masato Hirano, Takashi Shibuya, Yuichiro Koyama, Zhi Zhong, Shusuke Takahashi, Tatsuya Kawahara, Yuki Mitsufuji

At the decoded feature level, we fuse the two decoded features by generative and predictive decoders.

Decoder Speech Enhancement

Paper
Add Code

The Whole Is Greater than the Sum of Its Parts: Improving DNN-based Music Source Separation

1 code implementation • 13 May 2023 • Ryosuke Sawata, Naoya Takahashi, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji

We modify the target network, i. e., the network architecture of the original DNN-based MSS, by adding bridging paths for each output instrument to share their information.

Music Source Separation

2,139

Paper
Code

Diffusion-based Signal Refiner for Speech Separation

no code implementations • 10 May 2023 • Masato Hirano, Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, Yuki Mitsufuji

We experimentally show that our refiner can provide a clearer harmonic structure of speech and improves the reference-free metric of perceptual quality for arbitrary preceding model architectures.

Denoising Speech Enhancement +1

Paper
Add Code

PeaCoK: Persona Commonsense Knowledge for Consistent and Engaging Narratives

1 code implementation • 3 May 2023 • Silin Gao, Beatriz Borges, Soyoung Oh, Deniz Bayazit, Saya Kanno, Hiromi Wakaki, Yuki Mitsufuji, Antoine Bosselut

They must also learn to maintain consistent speaker personas for themselves throughout the narrative, so that their counterparts feel involved in a realistic conversation or story.

Knowledge Graphs World Knowledge

Paper
Code

Cross-modal Face- and Voice-style Transfer

no code implementations • 27 Feb 2023 • Naoya Takahashi, Mayank K. Singh, Yuki Mitsufuji

Image-to-image translation and voice conversion enable the generation of a new facial image and voice while maintaining some of the semantics such as a pose in an image and linguistic content in audio, respectively.

Image-to-Image Translation Open-Ended Question Answering +3

Paper
Add Code

GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Blind Inverse Problems with Denoising Diffusion Restoration

1 code implementation • 30 Jan 2023 • Naoki Murata, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon

Pre-trained diffusion models have been successfully used as priors in a variety of linear inverse problems, where the goal is to reconstruct a signal from noisy linear measurements.

Blind Image Deblurring Denoising +1

Paper
Code

SAN: Inducing Metrizability of GAN with Discriminative Normalized Linear Layer

1 code implementation • 30 Jan 2023 • Yuhta Takida, Masaaki Imaizumi, Takashi Shibuya, Chieh-Hsin Lai, Toshimitsu Uesaka, Naoki Murata, Yuki Mitsufuji

Generative adversarial networks (GANs) learn a target probability distribution by optimizing a generator and a discriminator with minimax objectives.

Ranked #1 on Image Generation on FFHQ 1024 x 1024

Image Generation

Paper
Code

CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

1 code implementation • 14 Dec 2022 • Hao-Wen Dong, Naoya Takahashi, Yuki Mitsufuji, Julian McAuley, Taylor Berg-Kirkpatrick

Further, videos in the wild often contain off-screen sounds and background noise that may hinder the model from learning the desired audio-textual correspondence.

Paper
Code

Unsupervised vocal dereverberation with diffusion-based generative models

no code implementations • 8 Nov 2022 • Koichi Saito, Naoki Murata, Toshimitsu Uesaka, Chieh-Hsin Lai, Yuhta Takida, Takao Fukui, Yuki Mitsufuji

Removing reverb from reverberant music is a necessary technique to clean up audio for downstream music manipulations.

Paper
Add Code

Music Mixing Style Transfer: A Contrastive Learning Approach to Disentangle Audio Effects

1 code implementation • 4 Nov 2022 • Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Kyogu Lee, Yuki Mitsufuji

We propose an end-to-end music mixing style transfer system that converts the mixing style of an input multitrack to that of a reference song.

Contrastive Learning Disentanglement +2

150

Paper
Code

Diffiner: A Versatile Diffusion-based Generative Refiner for Speech Enhancement

1 code implementation • 27 Oct 2022 • Ryosuke Sawata, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

Although deep neural network (DNN)-based speech enhancement (SE) methods outperform the previous non-DNN-based ones, they often degrade the perceptual quality of generated outputs.

Denoising Speech Enhancement

Paper
Code

ComFact: A Benchmark for Linking Contextual Commonsense Knowledge

1 code implementation • 23 Oct 2022 • Silin Gao, Jena D. Hwang, Saya Kanno, Hiromi Wakaki, Yuki Mitsufuji, Antoine Bosselut

Understanding rich narratives, such as dialogues and stories, often requires natural language processing systems to access relevant knowledge from commonsense knowledge graphs.

Knowledge Graphs Response Generation +1

Paper
Code

Robust One-Shot Singing Voice Conversion

no code implementations • 20 Oct 2022 • Naoya Takahashi, Mayank Kumar Singh, Yuki Mitsufuji

We then propose a two-stage training method called Robustify that train the one-shot SVC model in the first stage on clean data to ensure high-quality conversion, and introduces enhancement modules to the encoders of the model in the second stage to enhance the feature extraction from distorted singing voices.

Voice Conversion

Paper
Add Code

Hierarchical Diffusion Models for Singing Voice Neural Vocoder

no code implementations • 14 Oct 2022 • Naoya Takahashi, Mayank Kumar, Singh, Yuki Mitsufuji

Recent progress in deep generative models has improved the quality of neural vocoders in speech domain.

Paper
Add Code

DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability

no code implementations • 11 Oct 2022 • Kin Wai Cheuk, Ryosuke Sawata, Toshimitsu Uesaka, Naoki Murata, Naoya Takahashi, Shusuke Takahashi, Dorien Herremans, Yuki Mitsufuji

In this paper we propose a novel generative approach, DiffRoll, to tackle automatic music transcription (AMT).

Music Transcription

Paper
Add Code

FP-Diffusion: Improving Score-based Diffusion Models by Enforcing the Underlying Score Fokker-Planck Equation

1 code implementation • 9 Oct 2022 • Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon

Score-based generative models (SGMs) learn a family of noise-conditional score functions corresponding to the data density perturbed with increasingly large amounts of noise.

Denoising

Paper
Code

Automatic music mixing with deep learning and out-of-domain data

1 code implementation • 24 Aug 2022 • Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Giorgio Fabbro, Stefan Uhlich, Chihiro Nagashima, Yuki Mitsufuji

Music mixing traditionally involves recording instruments in the form of clean, individual tracks and blending them into a final mixture using audio effects and expert knowledge (e. g., a mixing engineer).

Paper
Code

STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

2 code implementations • 4 Jun 2022 • Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen

Additionally, the report presents the baseline system that accompanies the dataset in the challenge with emphasis on the differences with the baseline of the previous iterations; namely, introduction of the multi-ACCDOA representation to handle multiple simultaneous occurences of events of the same class, and support for additional improved input features for the microphone array format.

Ranked #1 on Sound Event Localization and Detection on STARSS22

Sound Event Localization and Detection

Paper
Code

SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization

1 code implementation • 16 May 2022 • Yuhta Takida, Takashi Shibuya, WeiHsiang Liao, Chieh-Hsin Lai, Junki Ohmura, Toshimitsu Uesaka, Naoki Murata, Shusuke Takahashi, Toshiyuki Kumakura, Yuki Mitsufuji

In this paper, we propose a new training scheme that extends the standard VAE via novel stochastic dequantization and quantization, called stochastically quantized variational autoencoder (SQ-VAE).

Quantization

166

Paper
Code

Distortion Audio Effects: Learning How to Recover the Clean Signal

no code implementations • 3 Feb 2022 • Johannes Imort, Giorgio Fabbro, Marco A. Martínez Ramírez, Stefan Uhlich, Yuichiro Koyama, Yuki Mitsufuji

Given the recent advances in music source separation and automatic mixing, removing audio effects in music tracks is a meaningful step toward developing an automated remixing system.

Music Source Separation

Paper
Add Code

Multi-ACCDOA: Localizing and Detecting Overlapping Sounds from the Same Class with Auxiliary Duplicating Permutation Invariant Training

2 code implementations • 14 Oct 2021 • Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, Naoya Takahashi, Emiru Tsunoo, Yuki Mitsufuji

The multi- ACCDOA format (a class- and track-wise output format) enables the model to solve the cases with overlaps from the same class.

Sound Event Localization and Detection

Paper
Code

Automatic DJ Transitions with Differentiable Audio Effects and Generative Adversarial Networks

no code implementations • 13 Oct 2021 • Bo-Yu Chen, Wei-Han Hsu, Wei-Hsiang Liao, Marco A. Martínez Ramírez, Yuki Mitsufuji, Yi-Hsuan Yang

A central task of a Disc Jockey (DJ) is to create a mixset of mu-sic with seamless transitions between adjacent tracks.

Generative Adversarial Network

Paper
Add Code

Spatial mixup: Directional loudness modification as data augmentation for sound event localization and detection

1 code implementation • 12 Oct 2021 • Ricardo Falcon-Perez, Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, Yuki Mitsufuji

Data augmentation methods have shown great importance in diverse supervised learning problems where labeled data is scarce or costly to obtain.

Data Augmentation Sound Event Localization and Detection

Paper
Code

Music Demixing Challenge 2021

1 code implementation • 31 Aug 2021 • Yuki Mitsufuji, Giorgio Fabbro, Stefan Uhlich, Fabian-Robert Stöter, Alexandre Défossez, Minseok Kim, Woosung Choi, Chin-Yun Yu, Kin-Wai Cheuk

The main differences compared with the past challenges are 1) the competition is designed to more easily allow machine learning practitioners from other disciplines to participate, 2) evaluation is done on a hidden test set created by music professionals dedicated exclusively to the challenge to assure the transparency of the challenge, i. e., the test set is not accessible from anyone except the challenge organizers, and 3) the dataset provides a wider range of music genres and involved a greater number of mixing engineers.

Music Source Separation

Paper
Code

Ensemble of ACCDOA- and EINV2-based Systems with D3Nets and Impulse Response Simulation for Sound Event Localization and Detection

no code implementations • 21 Jun 2021 • Kazuki Shimada, Naoya Takahashi, Yuichiro Koyama, Shusuke Takahashi, Emiru Tsunoo, Masafumi Takahashi, Yuki Mitsufuji

This report describes our systems submitted to the DCASE2021 challenge task 3: sound event localization and detection (SELD) with directional interference.

Data Augmentation Sound Event Localization and Detection

Paper
Add Code

Densely Connected Multi-Dilated Convolutional Networks for Dense Prediction Tasks

1 code implementation • CVPR 2021 • Naoya Takahashi, Yuki Mitsufuji

In this paper, we claim the importance of a dense simultaneous modeling of multiresolution representation and propose a novel CNN architecture called densely connected multidilated DenseNet (D3Net).

Audio Source Separation Semantic Segmentation

338

Paper
Code

Training Speech Enhancement Systems with Noisy Speech Datasets

no code implementations • 26 May 2021 • Koichi Saito, Stefan Uhlich, Giorgio Fabbro, Yuki Mitsufuji

Furthermore, we propose a noise augmentation scheme for mixture-invariant training (MixIT), which allows using it also in such scenarios.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Preventing Oversmoothing in VAE via Generalized Variance Parameterization

no code implementations • 17 Feb 2021 • Yuhta Takida, Wei-Hsiang Liao, Chieh-Hsin Lai, Toshimitsu Uesaka, Shusuke Takahashi, Yuki Mitsufuji

Variational autoencoders (VAEs) often suffer from posterior collapse, which is a phenomenon in which the learned latent space becomes uninformative.

Decoder

Paper
Add Code

Hierarchical disentangled representation learning for singing voice conversion

no code implementations • 18 Jan 2021 • Naoya Takahashi, Mayank Kumar Singh, Yuki Mitsufuji

Conventional singing voice conversion (SVC) methods often suffer from operating in high-resolution audio owing to a high dimensionality of data.

Representation Learning Voice Conversion

Paper
Add Code

AR-ELBO: Preventing Posterior Collapse Induced by Oversmoothing in Gaussian VAE

no code implementations • 1 Jan 2021 • Yuhta Takida, Wei-Hsiang Liao, Toshimitsu Uesaka, Shusuke Takahashi, Yuki Mitsufuji

Variational autoencoders (VAEs) often suffer from posterior collapse, which is a phenomenon that the learned latent space becomes uninformative.

Paper
Add Code

Densely connected multidilated convolutional networks for dense prediction tasks

1 code implementation • 21 Nov 2020 • Naoya Takahashi, Yuki Mitsufuji

In this paper, we claim the importance of a dense simultaneous modeling of multiresolution representation and propose a novel CNN architecture called densely connected multidilated DenseNet (D3Net).

Ranked #47 on Semantic Segmentation on Cityscapes test

Audio Source Separation Music Source Separation +1

338

Paper
Code

ACCDOA: Activity-Coupled Cartesian Direction of Arrival Representation for Sound Event Localization and Detection

2 code implementations • 29 Oct 2020 • Kazuki Shimada, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji

Conventional NN-based methods use two branches for a sound event detection (SED) target and a direction-of-arrival (DOA) target.

Event Detection Sound Event Detection +1

Paper
Code

All for One and One for All: Improving Music Separation by Bridging Networks

5 code implementations • 8 Oct 2020 • Ryosuke Sawata, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji

This paper proposes several improvements for music separation with deep neural networks (DNNs), namely a multi-domain loss (MDL) and two combination schemes.

Ranked #21 on Music Source Separation on MUSDB18

Music Source Separation

2,139

Paper
Code

Adversarial attacks on audio source separation

no code implementations • 7 Oct 2020 • Naoya Takahashi, Shota Inoue, Yuki Mitsufuji

Despite the excellent performance of neural-network-based audio source separation methods and their wide range of applications, their robustness against intentional attacks has been largely neglected.

Adversarial Attack Audio Source Separation

Paper
Add Code

D3Net: Densely connected multidilated DenseNet for music source separation

1 code implementation • 5 Oct 2020 • Naoya Takahashi, Yuki Mitsufuji

In this paper, we claim the importance of a rapid growth of a receptive field and a simultaneous modeling of multi-resolution data in a single convolution layer, and propose a novel CNN architecture called densely connected dilated DenseNet (D3Net).

Ranked #12 on Music Source Separation on MUSDB18 (using extra training data)

Music Source Separation

338

Paper
Code

Improving Voice Separation by Incorporating End-to-end Speech Recognition

1 code implementation • 29 Nov 2019 • Naoya Takahashi, Mayank Kumar Singh, Sakya Basak, Parthasaarathy Sudarsanam, Sriram Ganapathy, Yuki Mitsufuji

Despite recent advances in voice separation methods, many challenges remain in realistic scenarios such as noisy recording and the limits of available data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3