Search Results for author: Yuki Mitsufuji

Found 105 papers, 53 papers with code

Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance

no code implementations26 Jun 2025 Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

We propose a novel step-by-step video-to-audio generation method that sequentially produces individual audio tracks, each corresponding to a specific sound event in the video.

Audio Generation Audio Synthesis +1

Vid-CamEdit: Video Camera Trajectory Editing with Generative Rendering from Estimated Geometry

no code implementations16 Jun 2025 Junyoung Seo, Jisang Han, Jaewoo Jung, Siyoon Jin, Joungbin Lee, Takuya Narihira, Kazumi Fukuda, Takashi Shibuya, Donghoon Ahn, Shoukang Hu, Seungryong Kim, Yuki Mitsufuji

We eliminate the need for extensive 4D training data through a factorized fine-tuning framework that separately trains spatial and temporal components using multi-view image and video data.

Novel View Synthesis

A Comprehensive Real-World Assessment of Audio Watermarking Algorithms: Will They Survive Neural Codecs?

1 code implementation26 May 2025 Yigitcan Özer, Woosung Choi, Joan Serrà, Mayank Kumar Singh, Wei-Hsiang Liao, Yuki Mitsufuji

We introduce the Robust Audio Watermarking Benchmark (RAW-Bench), a benchmark for evaluating deep learning-based audio watermarking methods with standardized and systematic comparisons.

SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet

no code implementations22 May 2025 Zhi Zhong, Akira Takahashi, Shuyang Cui, Keisuke Toyama, Shusuke Takahashi, Yuki Mitsufuji

ControlNet, a method for adding fine-grained controls to pretrained generative models, has been applied to foley synthesis, but its use has been limited to handcrafted human-readable temporal conditions.

Audio Synthesis

Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior

1 code implementation16 May 2025 Chin-Yun Yu, Marco A. Martínez-Ramírez, Junghyun Koo, Wei-Hsiang Liao, Yuki Mitsufuji, György Fazekas

Style Transfer with Inference-Time Optimisation (ST-ITO) is a recent approach for transferring the applied effects of a reference audio to a raw audio track.

Style Transfer

Dyadic Mamba: Long-term Dyadic Human Motion Synthesis

no code implementations14 May 2025 Julian Tanke, Takashi Shibuya, Kengo Uchida, Koichi Saito, Yuki Mitsufuji

Generating realistic dyadic human motion from text descriptions presents significant challenges, particularly for extended interactions that exceed typical training sequence lengths.

Mamba Motion Synthesis +1

Forging and Removing Latent-Noise Diffusion Watermarks Using a Single Image

1 code implementation27 Apr 2025 Anubhav Jain, Yuya Kobayashi, Naoki Murata, Yuhta Takida, Takashi Shibuya, Yuki Mitsufuji, Niv Cohen, Nasir Memon, Julian Togelius

Based on this intuition, we propose an adversarial attack to forge the watermark by introducing perturbations to the images such that we can enter the region of watermarked images.

Adversarial Attack

D^2USt3R: Enhancing 3D Reconstruction with 4D Pointmaps for Dynamic Scenes

no code implementations8 Apr 2025 Jisang Han, Honggyu An, Jaewoo Jung, Takuya Narihira, Junyoung Seo, Kazumi Fukuda, Chaehyun Kim, Sunghwan Hong, Yuki Mitsufuji, Seungryong Kim

We address the task of 3D reconstruction in dynamic scenes, where object motions degrade the quality of previous 3D pointmap regression methods, such as DUSt3R, originally designed for static 3D scene reconstruction.

3D Reconstruction 3D Scene Reconstruction

CARE: Aligning Language Models for Regional Cultural Awareness

1 code implementation7 Apr 2025 Geyang Guo, Tarek Naous, Hiromi Wakaki, Yukiko Nishimura, Yuki Mitsufuji, Alan Ritter, Wei Xu

Existing language models (LMs) often exhibit a Western-centric bias and struggle to represent diverse cultural knowledge.

Aligning Text-to-Music Evaluation with Human Preferences

1 code implementation20 Mar 2025 Yichen Huang, Zachary Novack, Koichi Saito, Jiatong Shi, Shinji Watanabe, Yuki Mitsufuji, John Thickstun, Chris Donahue

In this work, we rigorously study the design space of reference-based divergence metrics for evaluating TTM models through (1) designing four synthetic meta-evaluations to measure sensitivity to particular musical desiderata, and (2) collecting and evaluating on MusicPrefs, the first open-source dataset of human preferences for TTM systems.

FAD

Cross-Modal Learning for Music-to-Music-Video Description Generation

no code implementations14 Mar 2025 Zhuoyuan Mao, Mengjie Zhao, Qiyu Wu, Zhi Zhong, Wei-Hsiang Liao, Hiromi Wakaki, Yuki Mitsufuji

In this study, we focus on the MV description generation task and propose a comprehensive pipeline encompassing training data construction and multimodal model fine-tuning.

Video Description Video Generation

Training Consistency Models with Variational Noise Coupling

1 code implementation25 Feb 2025 Gianluigi Silvestri, Luca Ambrogioni, Chieh-Hsin Lai, Yuhta Takida, Yuki Mitsufuji

Consistency Training (CT) has recently emerged as a promising alternative to diffusion models, achieving competitive performance in image generation tasks.

Image Generation

Supervised contrastive learning from weakly-labeled audio segments for musical version matching

no code implementations24 Feb 2025 Joan Serrà, R. Oguz Araz, Dmitry Bogdanov, Yuki Mitsufuji

Detecting musical versions (different renditions of the same piece) is a challenging task with important applications.

Contrastive Learning Triplet

DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning

no code implementations18 Feb 2025 Zhuoyuan Mao, Mengjie Zhao, Qiyu Wu, Hiromi Wakaki, Yuki Mitsufuji

Recent advancements in music large language models (LLMs) have significantly improved music understanding tasks, which involve the model's ability to analyze and interpret various musical elements.

HumanGif: Single-View Human Diffusion with Generative Prior

1 code implementation17 Feb 2025 Shoukang Hu, Takuya Narihira, Kazumi Fukuda, Ryosuke Sawata, Takashi Shibuya, Yuki Mitsufuji

While previous single-view-based 3D human reconstruction methods made significant progress in novel view synthesis, it remains a challenge to synthesize both view-consistent and pose-consistent results for animatable human avatars from a single image input.

3D Human Reconstruction NeRF +1

30+ Years of Source Separation Research: Achievements and Future Challenges

no code implementations21 Jan 2025 Shoko Araki, Nobutaka Ito, Reinhold Haeb-Umbach, Gordon Wichern, Zhong-Qiu Wang, Yuki Mitsufuji

Source separation (SS) of acoustic signals is a research field that emerged in the mid-1990s and has flourished ever since.

Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models

no code implementations15 Jan 2025 Zerui Tao, Yuhta Takida, Naoki Murata, Qibin Zhao, Yuki Mitsufuji

To reduce this gap and further improve the power of LoRA, we propose a new PEFT method that combines two classes of adaptations, namely, transform and residual adaptations.

parameter-efficient fine-tuning Tensor Decomposition

CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation

no code implementations6 Jan 2025 Yuanhong Chen, Kazuki Shimada, Christian Simon, Yukara Ikemiya, Takashi Shibuya, Yuki Mitsufuji

Binaural audio generation (BAG) aims to convert monaural audio to stereo audio using visual prompts, requiring a deep understanding of spatial and semantic information.

Audio Generation Contrastive Learning

TraSCE: Trajectory Steering for Concept Erasure

1 code implementation10 Dec 2024 Anubhav Jain, Yuya Kobayashi, Takashi Shibuya, Yuhta Takida, Nasir Memon, Julian Togelius, Yuki Mitsufuji

To address this issue, we first propose a modification of conventional negative prompting.

Blind Inverse Problem Solving Made Easy by Text-to-Image Latent Diffusion

no code implementations30 Nov 2024 Michail Dontas, Yutong He, Naoki Murata, Yuki Mitsufuji, J. Zico Kolter, Ruslan Salakhutdinov

Blind inverse problems, where both the target data and forward operator are unknown, are crucial to many computer vision applications.

Image Restoration

OpenMU: Your Swiss Army Knife for Music Understanding

2 code implementations21 Oct 2024 Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao, Shiqi Yang, Wei-Hsiang Liao, Shusuke Takahashi, Hiromi Wakaki, Yuki Mitsufuji

We present OpenMU-Bench, a large-scale benchmark suite for addressing the data scarcity issue in training multimodal language models to understand music.

Improving Vector-Quantized Image Modeling with Latent Consistency-Matching Diffusion

no code implementations18 Oct 2024 Bac Nguyen, Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Stefano Ermon, Yuki Mitsufuji

By embedding discrete representations into a continuous latent space, we can leverage continuous-space latent diffusion models to handle generative modeling of discrete data.

Conditional Image Generation Machine Translation +1

Distillation of Discrete Diffusion through Dimensional Correlations

1 code implementation11 Oct 2024 Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, Yuki Mitsufuji

Diffusion models have demonstrated exceptional performances in various fields of generative modeling, but suffer from slow sampling speed due to their iterative nature.

$\textit{Jump Your Steps}$: Optimizing Sampling Schedule of Discrete Diffusion Models

no code implementations10 Oct 2024 Yong-Hyun Park, Chieh-Hsin Lai, Satoshi Hayakawa, Yuhta Takida, Yuki Mitsufuji

Diffusion models have seen notable success in continuous domains, leading to the development of discrete diffusion models (DDMs) for discrete variables.

Text Generation

G2D2: Gradient-guided Discrete Diffusion for image inverse problem solving

no code implementations9 Oct 2024 Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Bac Nguyen, Stefano Ermon, Yuki Mitsufuji

Recent literature has effectively utilized diffusion models trained on continuous variables as priors for solving inverse problems.

Image Generation Motion Generation

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

1 code implementation8 Oct 2024 M. Jehanzeb Mirza, Mengjie Zhao, Zhuoyuan Mao, Sivan Doveh, Wei Lin, Paul Gavrikov, Michael Dorkenwald, Shiqi Yang, Saurav Jha, Hiromi Wakaki, Yuki Mitsufuji, Horst Possegger, Rogerio Feris, Leonid Karlinsky, James Glass

In each respective optimization step, the ranked prompts are fed as in-context examples (with their accuracies) to equip the LLM with the knowledge of the type of text prompts preferred by the downstream VLM.

zero-shot-classification Zero-Shot Learning

Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models

no code implementations1 Oct 2024 Saurav Jha, Shiqi Yang, Masato Ishii, Mengjie Zhao, Christian Simon, Muhammad Jehanzeb Mirza, Dong Gong, Lina Yao, Shusuke Takahashi, Yuki Mitsufuji

Personalized text-to-image diffusion models have grown popular for their ability to efficiently acquire a new concept from user-defined text descriptions and a few images.

Continual Learning

A Survey on Diffusion Models for Inverse Problems

no code implementations30 Sep 2024 Giannis Daras, Hyungjin Chung, Chieh-Hsin Lai, Yuki Mitsufuji, Jong Chul Ye, Peyman Milanfar, Alexandros G. Dimakis, Mauricio Delbracio

Diffusion models have become increasingly popular for generative modeling due to their ability to generate high-quality samples.

Image Restoration Survey

A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation

1 code implementation26 Sep 2024 Masato Ishii, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji

In CMC-PE, cross-modal information is embedded as if it represents temporal position information, and the embeddings are fed into the model like positional encoding.

Inductive Bias Video Generation

DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

no code implementations20 Aug 2024 Yin-Jyun Luo, Kin Wai Cheuk, Woosung Choi, Toshimitsu Uesaka, Keisuke Toyama, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Wei-Hsiang Liao, Simon Dixon, Yuki Mitsufuji

Existing work on pitch and timbre disentanglement has been mostly focused on single-instrument music audio, excluding the cases where multiple instruments are presented.

Attribute Disentanglement

Towards Assessing Data Replication in Music Generation with Music Similarity Metrics on Raw Audio

1 code implementation19 Jul 2024 Roser Batlle-Roca, Wei-Hisang Liao, Xavier Serra, Yuki Mitsufuji, Emilia Gómez

Recent advancements in music generation are raising multiple concerns about the implications of AI in creative music processes, current business models and impacts related to intellectual property management.

Management Music Generation

ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark

no code implementations17 Jun 2024 Hiromi Wakaki, Yuki Mitsufuji, Yoshinori Maeda, Yukiko Nishimura, Silin Gao, Mengjie Zhao, Keiichi Yamada, Antoine Bosselut

We propose a new benchmark, ComperDial, which facilitates the training and evaluation of evaluation metrics for open-domain dialogue systems.

MoLA: Motion Generation and Editing with Latent Diffusion Enhanced by Adversarial Training

no code implementations4 Jun 2024 Kengo Uchida, Takashi Shibuya, Yuhta Takida, Naoki Murata, Shusuke Takahashi, Yuki Mitsufuji

For high-quality and fast generation, we employ a variational autoencoder and latent diffusion model, and improve the performance with adversarial training.

Motion Generation Motion Synthesis

SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

1 code implementation28 May 2024 Koichi Saito, Dongjun Kim, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong, Yuhta Takida, Yuki Mitsufuji

To address these issues, we introduce Sound Consistency Trajectory Models (SoundCTM), which allow flexible transitions between high-quality $1$-step sound generation and superior sound quality through multi-step deterministic sampling.

AudioCaps Audio Generation +1

MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

1 code implementation28 May 2024 Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

Specifically, given two pre-trained base diffusion models, we train a lightweight joint guidance module to adjust scores separately estimated by the base models to match the score of joint distribution over audio and video.

Video Generation

Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning

2 code implementations28 May 2024 Yixiao Zhang, Yukara Ikemiya, Woosung Choi, Naoki Murata, Marco A. Martínez-Ramírez, Liwei Lin, Gus Xia, Wei-Hsiang Liao, Yuki Mitsufuji, Simon Dixon

Recent advances in text-to-music editing, which employ text queries to modify music (e. g.\ by changing its style or adjusting instrumental components), present unique challenges and opportunities for AI-assisted music creation.

GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

1 code implementation27 May 2024 Junyoung Seo, Kazumi Fukuda, Takashi Shibuya, Takuya Narihira, Naoki Murata, Shoukang Hu, Chieh-Hsin Lai, Seungryong Kim, Yuki Mitsufuji

In these methods, an input view is geometrically warped to novel views with estimated depth maps, then the warped image is inpainted by T2I models.

Diversity Monocular Depth Estimation +1

Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation

no code implementations28 Mar 2024 Yutong He, Alexander Robey, Naoki Murata, Yiding Jiang, Joshua Nathaniel Williams, George J. Pappas, Hamed Hassani, Yuki Mitsufuji, Ruslan Salakhutdinov, J. Zico Kolter

Prompt engineering is effective for controlling the output of text-to-image (T2I) generative models, but it is also laborious due to the need for manually crafted prompts.

In-Context Learning Language Modeling +5

MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage

1 code implementation15 Mar 2024 Hao Hao Tan, Kin Wai Cheuk, Taemin Cho, Wei-Hsiang Liao, Yuki Mitsufuji

This paper presents enhancements to the MT3 model, a state-of-the-art (SOTA) token-based multi-instrument automatic music transcription (AMT) model.

Music Transcription

DiffuCOMET: Contextual Commonsense Knowledge Diffusion

1 code implementation26 Feb 2024 Silin Gao, Mete Ismayilzada, Mengjie Zhao, Hiromi Wakaki, Yuki Mitsufuji, Antoine Bosselut

Inferring contextually-relevant and diverse commonsense to understand narratives remains challenging for knowledge models.

Diversity

MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models

1 code implementation9 Feb 2024 Yixiao Zhang, Yukara Ikemiya, Gus Xia, Naoki Murata, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Yuki Mitsufuji, Simon Dixon

This paper introduces a novel approach to the editing of music generated by such models, enabling the modification of specific attributes, such as genre, mood and instrument, while maintaining other aspects unchanged.

Music Generation Text-to-Music Generation

Manifold Preserving Guided Diffusion

no code implementations28 Nov 2023 Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao, Yuki Mitsufuji, J. Zico Kolter, Ruslan Salakhutdinov, Stefano Ermon

Despite the recent advancements, conditional image generation still faces challenges of cost, generalizability, and the need for task-specific training.

Conditional Image Generation

Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association

no code implementations2 Oct 2023 Qiyu Wu, Mengjie Zhao, Yutong He, Lang Huang, Junya Ono, Hiromi Wakaki, Yuki Mitsufuji

In this paper, we focus on the wide existence of reporting bias in visual-language datasets, embodied as the object-attribute association, which can subsequentially degrade models trained on them.

Attribute Object

Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion

2 code implementations1 Oct 2023 Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, Stefano Ermon

Consistency Models (CM) (Song et al., 2023) accelerate score-based diffusion model sampling at the cost of sample quality but lack a natural way to trade-off quality for speed.

 Ranked #1 on Image Generation on ImageNet 64x64 (NFE metric)

Denoising Image Generation

Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription

no code implementations27 Sep 2023 Frank Cwitkowitz, Kin Wai Cheuk, Woosung Choi, Marco A. Martínez-Ramírez, Keisuke Toyama, Wei-Hsiang Liao, Yuki Mitsufuji

Several works have explored multi-instrument transcription as a means to bolster the performance of models on low-resource tasks, but these methods face the same data availability issues.

Music Transcription

BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

3 code implementations6 Sep 2023 Takashi Shibuya, Yuhta Takida, Yuki Mitsufuji

In the literature, it has been demonstrated that slicing adversarial network (SAN), an improved GAN training framework that can find the optimal projection, is effective in the image generation task.

Generative Adversarial Network Speech Synthesis

Enhancing Semantic Communication with Deep Generative Models -- An ICASSP Special Session Overview

no code implementations5 Sep 2023 Eleonora Grassucci, Yuki Mitsufuji, Ping Zhang, Danilo Comminiello

Semantic communication is poised to play a pivotal role in shaping the landscape of future AI-driven communication systems.

Semantic Communication

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

1 code implementation NeurIPS 2023 Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, Yuki Mitsufuji

While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e. g., sounds of footsteps come from the feet of a walker.

Sound Event Localization and Detection

On the Equivalence of Consistency-Type Models: Consistency Models, Consistent Diffusion Models, and Fokker-Planck Regularization

no code implementations1 Jun 2023 Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Naoki Murata, Yuki Mitsufuji, Stefano Ermon

The emergence of various notions of ``consistency'' in diffusion models has garnered considerable attention and helped achieve improved sample quality, likelihood estimation, and accelerated sampling.

The Whole Is Greater than the Sum of Its Parts: Improving Music Source Separation by Bridging Network

1 code implementation13 May 2023 Ryosuke Sawata, Naoya Takahashi, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji

We modify the target network, i. e., the network architecture of the original DNN-based MSS, by adding bridging paths for each output instrument to share their information.

Music Source Separation

Diffusion-based Signal Refiner for Speech Separation

no code implementations10 May 2023 Masato Hirano, Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, Yuki Mitsufuji

We experimentally show that our refiner can provide a clearer harmonic structure of speech and improves the reference-free metric of perceptual quality for arbitrary preceding model architectures.

Denoising Speech Enhancement +1

PeaCoK: Persona Commonsense Knowledge for Consistent and Engaging Narratives

1 code implementation3 May 2023 Silin Gao, Beatriz Borges, Soyoung Oh, Deniz Bayazit, Saya Kanno, Hiromi Wakaki, Yuki Mitsufuji, Antoine Bosselut

They must also learn to maintain consistent speaker personas for themselves throughout the narrative, so that their counterparts feel involved in a realistic conversation or story.

Knowledge Graphs World Knowledge

Cross-modal Face- and Voice-style Transfer

no code implementations27 Feb 2023 Naoya Takahashi, Mayank K. Singh, Yuki Mitsufuji

Image-to-image translation and voice conversion enable the generation of a new facial image and voice while maintaining some of the semantics such as a pose in an image and linguistic content in audio, respectively.

Diversity Image-to-Image Translation +4

SAN: Inducing Metrizability of GAN with Discriminative Normalized Linear Layer

1 code implementation30 Jan 2023 Yuhta Takida, Masaaki Imaizumi, Takashi Shibuya, Chieh-Hsin Lai, Toshimitsu Uesaka, Naoki Murata, Yuki Mitsufuji

Generative adversarial networks (GANs) learn a target probability distribution by optimizing a generator and a discriminator with minimax objectives.

Image Generation

GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Blind Inverse Problems with Denoising Diffusion Restoration

1 code implementation30 Jan 2023 Naoki Murata, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon

Pre-trained diffusion models have been successfully used as priors in a variety of linear inverse problems, where the goal is to reconstruct a signal from noisy linear measurements.

Denoising Image Deblurring

CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

1 code implementation14 Dec 2022 Hao-Wen Dong, Naoya Takahashi, Yuki Mitsufuji, Julian McAuley, Taylor Berg-Kirkpatrick

Further, videos in the wild often contain off-screen sounds and background noise that may hinder the model from learning the desired audio-textual correspondence.

Unsupervised vocal dereverberation with diffusion-based generative models

no code implementations8 Nov 2022 Koichi Saito, Naoki Murata, Toshimitsu Uesaka, Chieh-Hsin Lai, Yuhta Takida, Takao Fukui, Yuki Mitsufuji

Removing reverb from reverberant music is a necessary technique to clean up audio for downstream music manipulations.

Diversity

Music Mixing Style Transfer: A Contrastive Learning Approach to Disentangle Audio Effects

1 code implementation4 Nov 2022 Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Kyogu Lee, Yuki Mitsufuji

We propose an end-to-end music mixing style transfer system that converts the mixing style of an input multitrack to that of a reference song.

Contrastive Learning Disentanglement +2

Diffiner: A Versatile Diffusion-based Generative Refiner for Speech Enhancement

1 code implementation27 Oct 2022 Ryosuke Sawata, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

Although deep neural network (DNN)-based speech enhancement (SE) methods outperform the previous non-DNN-based ones, they often degrade the perceptual quality of generated outputs.

Denoising Speech Enhancement

ComFact: A Benchmark for Linking Contextual Commonsense Knowledge

1 code implementation23 Oct 2022 Silin Gao, Jena D. Hwang, Saya Kanno, Hiromi Wakaki, Yuki Mitsufuji, Antoine Bosselut

Understanding rich narratives, such as dialogues and stories, often requires natural language processing systems to access relevant knowledge from commonsense knowledge graphs.

Knowledge Graphs Response Generation +1

Robust One-Shot Singing Voice Conversion

no code implementations20 Oct 2022 Naoya Takahashi, Mayank Kumar Singh, Yuki Mitsufuji

We then propose a two-stage training method called Robustify that train the one-shot SVC model in the first stage on clean data to ensure high-quality conversion, and introduces enhancement modules to the encoders of the model in the second stage to enhance the feature extraction from distorted singing voices.

Voice Conversion

Hierarchical Diffusion Models for Singing Voice Neural Vocoder

no code implementations14 Oct 2022 Naoya Takahashi, Mayank Kumar, Singh, Yuki Mitsufuji

Recent progress in deep generative models has improved the quality of neural vocoders in speech domain.

FP-Diffusion: Improving Score-based Diffusion Models by Enforcing the Underlying Score Fokker-Planck Equation

1 code implementation9 Oct 2022 Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon

Score-based generative models (SGMs) learn a family of noise-conditional score functions corresponding to the data density perturbed with increasingly large amounts of noise.

Denoising

Automatic music mixing with deep learning and out-of-domain data

1 code implementation24 Aug 2022 Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Giorgio Fabbro, Stefan Uhlich, Chihiro Nagashima, Yuki Mitsufuji

Music mixing traditionally involves recording instruments in the form of clean, individual tracks and blending them into a final mixture using audio effects and expert knowledge (e. g., a mixing engineer).

STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

2 code implementations4 Jun 2022 Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen

Additionally, the report presents the baseline system that accompanies the dataset in the challenge with emphasis on the differences with the baseline of the previous iterations; namely, introduction of the multi-ACCDOA representation to handle multiple simultaneous occurences of events of the same class, and support for additional improved input features for the microphone array format.

Sound Event Localization and Detection

SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization

1 code implementation16 May 2022 Yuhta Takida, Takashi Shibuya, WeiHsiang Liao, Chieh-Hsin Lai, Junki Ohmura, Toshimitsu Uesaka, Naoki Murata, Shusuke Takahashi, Toshiyuki Kumakura, Yuki Mitsufuji

In this paper, we propose a new training scheme that extends the standard VAE via novel stochastic dequantization and quantization, called stochastically quantized variational autoencoder (SQ-VAE).

Quantization

Distortion Audio Effects: Learning How to Recover the Clean Signal

no code implementations3 Feb 2022 Johannes Imort, Giorgio Fabbro, Marco A. Martínez Ramírez, Stefan Uhlich, Yuichiro Koyama, Yuki Mitsufuji

Given the recent advances in music source separation and automatic mixing, removing audio effects in music tracks is a meaningful step toward developing an automated remixing system.

Music Source Separation

Music Demixing Challenge 2021

1 code implementation31 Aug 2021 Yuki Mitsufuji, Giorgio Fabbro, Stefan Uhlich, Fabian-Robert Stöter, Alexandre Défossez, Minseok Kim, Woosung Choi, Chin-Yun Yu, Kin-Wai Cheuk

The main differences compared with the past challenges are 1) the competition is designed to more easily allow machine learning practitioners from other disciplines to participate, 2) evaluation is done on a hidden test set created by music professionals dedicated exclusively to the challenge to assure the transparency of the challenge, i. e., the test set is not accessible from anyone except the challenge organizers, and 3) the dataset provides a wider range of music genres and involved a greater number of mixing engineers.

Music Source Separation

Densely Connected Multi-Dilated Convolutional Networks for Dense Prediction Tasks

1 code implementation CVPR 2021 Naoya Takahashi, Yuki Mitsufuji

In this paper, we claim the importance of a dense simultaneous modeling of multiresolution representation and propose a novel CNN architecture called densely connected multidilated DenseNet (D3Net).

Audio Source Separation Semantic Segmentation

Training Speech Enhancement Systems with Noisy Speech Datasets

no code implementations26 May 2021 Koichi Saito, Stefan Uhlich, Giorgio Fabbro, Yuki Mitsufuji

Furthermore, we propose a noise augmentation scheme for mixture-invariant training (MixIT), which allows using it also in such scenarios.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Preventing Oversmoothing in VAE via Generalized Variance Parameterization

no code implementations17 Feb 2021 Yuhta Takida, Wei-Hsiang Liao, Chieh-Hsin Lai, Toshimitsu Uesaka, Shusuke Takahashi, Yuki Mitsufuji

Variational autoencoders (VAEs) often suffer from posterior collapse, which is a phenomenon in which the learned latent space becomes uninformative.

Decoder

Hierarchical disentangled representation learning for singing voice conversion

no code implementations18 Jan 2021 Naoya Takahashi, Mayank Kumar Singh, Yuki Mitsufuji

Conventional singing voice conversion (SVC) methods often suffer from operating in high-resolution audio owing to a high dimensionality of data.

Representation Learning Voice Conversion

AR-ELBO: Preventing Posterior Collapse Induced by Oversmoothing in Gaussian VAE

no code implementations1 Jan 2021 Yuhta Takida, Wei-Hsiang Liao, Toshimitsu Uesaka, Shusuke Takahashi, Yuki Mitsufuji

Variational autoencoders (VAEs) often suffer from posterior collapse, which is a phenomenon that the learned latent space becomes uninformative.

Densely connected multidilated convolutional networks for dense prediction tasks

1 code implementation21 Nov 2020 Naoya Takahashi, Yuki Mitsufuji

In this paper, we claim the importance of a dense simultaneous modeling of multiresolution representation and propose a novel CNN architecture called densely connected multidilated DenseNet (D3Net).

Audio Source Separation Music Source Separation +1

All for One and One for All: Improving Music Separation by Bridging Networks

5 code implementations8 Oct 2020 Ryosuke Sawata, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji

This paper proposes several improvements for music separation with deep neural networks (DNNs), namely a multi-domain loss (MDL) and two combination schemes.

All Music Source Separation

Adversarial attacks on audio source separation

no code implementations7 Oct 2020 Naoya Takahashi, Shota Inoue, Yuki Mitsufuji

Despite the excellent performance of neural-network-based audio source separation methods and their wide range of applications, their robustness against intentional attacks has been largely neglected.

Adversarial Attack Audio Source Separation

D3Net: Densely connected multidilated DenseNet for music source separation

1 code implementation5 Oct 2020 Naoya Takahashi, Yuki Mitsufuji

In this paper, we claim the importance of a rapid growth of a receptive field and a simultaneous modeling of multi-resolution data in a single convolution layer, and propose a novel CNN architecture called densely connected dilated DenseNet (D3Net).

Ranked #12 on Music Source Separation on MUSDB18 (using extra training data)

Music Source Separation

Improving Voice Separation by Incorporating End-to-end Speech Recognition

1 code implementation29 Nov 2019 Naoya Takahashi, Mayank Kumar Singh, Sakya Basak, Parthasaarathy Sudarsanam, Sriram Ganapathy, Yuki Mitsufuji

Despite recent advances in voice separation methods, many challenges remain in realistic scenarios such as noisy recording and the limits of available data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

MMDenseLSTM: An efficient combination of convolutional and recurrent neural networks for audio source separation

1 code implementation7 May 2018 Naoya Takahashi, Nabarun Goswami, Yuki Mitsufuji

Deep neural networks have become an indispensable technique for audio source separation (ASS).

Ranked #17 on Music Source Separation on MUSDB18 (using extra training data)

Music Source Separation Sound Audio and Speech Processing

Cannot find the paper you are looking for? You can Submit a new open access paper.