Search Results for author: Naoya Takahashi

Found 26 papers, 14 papers with code

Multi-scale Multi-band DenseNets for Audio Source Separation

5 code implementations • 29 Jun 2017 • Naoya Takahashi, Yuki Mitsufuji

This paper deals with the problem of audio source separation.

Audio Source Separation Music Source Separation

14,809

Paper
Code

The Whole Is Greater than the Sum of Its Parts: Improving DNN-based Music Source Separation

1 code implementation • 13 May 2023 • Ryosuke Sawata, Naoya Takahashi, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji

We modify the target network, i. e., the network architecture of the original DNN-based MSS, by adding bridging paths for each output instrument to share their information.

Music Source Separation

2,104

Paper
Code

MMDenseLSTM: An efficient combination of convolutional and recurrent neural networks for audio source separation

1 code implementation • 7 May 2018 • Naoya Takahashi, Nabarun Goswami, Yuki Mitsufuji

Deep neural networks have become an indispensable technique for audio source separation (ASS).

Ranked #17 on Music Source Separation on MUSDB18 (using extra training data)

Music Source Separation Sound Audio and Speech Processing

1,387

Paper
Code

D3Net: Densely connected multidilated DenseNet for music source separation

1 code implementation • 5 Oct 2020 • Naoya Takahashi, Yuki Mitsufuji

In this paper, we claim the importance of a rapid growth of a receptive field and a simultaneous modeling of multi-resolution data in a single convolution layer, and propose a novel CNN architecture called densely connected dilated DenseNet (D3Net).

Ranked #12 on Music Source Separation on MUSDB18 (using extra training data)

Music Source Separation

336

Paper
Code

Densely connected multidilated convolutional networks for dense prediction tasks

1 code implementation • 21 Nov 2020 • Naoya Takahashi, Yuki Mitsufuji

In this paper, we claim the importance of a dense simultaneous modeling of multiresolution representation and propose a novel CNN architecture called densely connected multidilated DenseNet (D3Net).

Ranked #47 on Semantic Segmentation on Cityscapes test

Audio Source Separation Music Source Separation +1

336

Paper
Code

Densely Connected Multi-Dilated Convolutional Networks for Dense Prediction Tasks

1 code implementation • CVPR 2021 • Naoya Takahashi, Yuki Mitsufuji

In this paper, we claim the importance of a dense simultaneous modeling of multiresolution representation and propose a novel CNN architecture called densely connected multidilated DenseNet (D3Net).

Audio Source Separation Semantic Segmentation

336

Paper
Code

AENet: Learning Deep Audio Features for Video Analysis

1 code implementation • 3 Jan 2017 • Naoya Takahashi, Michael Gygli, Luc van Gool

Instead, combining visual features with our AENet features, which can be computed efficiently on a GPU, leads to significant performance improvements on action recognition and video highlight detection.

Action Recognition Data Augmentation +4

Paper
Code

Multi-ACCDOA: Localizing and Detecting Overlapping Sounds from the Same Class with Auxiliary Duplicating Permutation Invariant Training

2 code implementations • 14 Oct 2021 • Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, Naoya Takahashi, Emiru Tsunoo, Yuki Mitsufuji

The multi- ACCDOA format (a class- and track-wise output format) enables the model to solve the cases with overlaps from the same class.

Sound Event Localization and Detection

Paper
Code

STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

2 code implementations • 4 Jun 2022 • Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen

Additionally, the report presents the baseline system that accompanies the dataset in the challenge with emphasis on the differences with the baseline of the previous iterations; namely, introduction of the multi-ACCDOA representation to handle multiple simultaneous occurences of events of the same class, and support for additional improved input features for the microphone array format.

Ranked #1 on Sound Event Localization and Detection on STARSS22

Sound Event Localization and Detection

Paper
Code

ACCDOA: Activity-Coupled Cartesian Direction of Arrival Representation for Sound Event Localization and Detection

2 code implementations • 29 Oct 2020 • Kazuki Shimada, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji

Conventional NN-based methods use two branches for a sound event detection (SED) target and a direction-of-arrival (DOA) target.

Event Detection Sound Event Detection +1

Paper
Code

End-to-end lyrics Recognition with Voice to Singing Style Transfer

1 code implementation • 17 Feb 2021 • Sakya Basak, Shrutina Agarwal, Sriram Ganapathy, Naoya Takahashi

This approach, called voice to singing (V2S), performs the voice style conversion by modulating the F0 contour of the natural speech with that of a singing voice.

Data Augmentation Language Modelling +2

Paper
Code

CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

1 code implementation • 14 Dec 2022 • Hao-Wen Dong, Naoya Takahashi, Yuki Mitsufuji, Julian McAuley, Taylor Berg-Kirkpatrick

Further, videos in the wild often contain off-screen sounds and background noise that may hinder the model from learning the desired audio-textual correspondence.

Paper
Code

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

1 code implementation • NeurIPS 2023 • Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, Yuki Mitsufuji

While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e. g., sounds of footsteps come from the feet of a walker.

Sound Event Localization and Detection

Paper
Code

Improving Voice Separation by Incorporating End-to-end Speech Recognition

1 code implementation • 29 Nov 2019 • Naoya Takahashi, Mayank Kumar Singh, Sakya Basak, Parthasaarathy Sudarsanam, Sriram Ganapathy, Yuki Mitsufuji

Despite recent advances in voice separation methods, many challenges remain in realistic scenarios such as noisy recording and the limits of available data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Code

Automatic Pronunciation Generation by Utilizing a Semi-supervised Deep Neural Networks

no code implementations • 15 Jun 2016 • Naoya Takahashi, Tofigh Naghibi, Beat Pfister

Phonemic or phonetic sub-word units are the most commonly used atomic elements to represent speech signals in modern ASRs.

speech-recognition Speech Recognition

Paper
Add Code

Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection

no code implementations • 25 Apr 2016 • Naoya Takahashi, Michael Gygli, Beat Pfister, Luc van Gool

We propose a novel method for Acoustic Event Detection (AED).

Sound Multimedia

Paper
Add Code

Adversarial attacks on audio source separation

no code implementations • 7 Oct 2020 • Naoya Takahashi, Shota Inoue, Yuki Mitsufuji

Despite the excellent performance of neural-network-based audio source separation methods and their wide range of applications, their robustness against intentional attacks has been largely neglected.

Adversarial Attack Audio Source Separation

Paper
Add Code

Hierarchical disentangled representation learning for singing voice conversion

no code implementations • 18 Jan 2021 • Naoya Takahashi, Mayank Kumar Singh, Yuki Mitsufuji

Conventional singing voice conversion (SVC) methods often suffer from operating in high-resolution audio owing to a high dimensionality of data.

Representation Learning Voice Conversion

Paper
Add Code

Ensemble of ACCDOA- and EINV2-based Systems with D3Nets and Impulse Response Simulation for Sound Event Localization and Detection

no code implementations • 21 Jun 2021 • Kazuki Shimada, Naoya Takahashi, Yuichiro Koyama, Shusuke Takahashi, Emiru Tsunoo, Masafumi Takahashi, Yuki Mitsufuji

This report describes our systems submitted to the DCASE2021 challenge task 3: sound event localization and detection (SELD) with directional interference.

Data Augmentation Sound Event Localization and Detection

Paper
Add Code

Leveraging Symmetrical Convolutional Transformer Networks for Speech to Singing Voice Style Transfer

no code implementations • 26 Aug 2022 • Shrutina Agarwal, Sriram Ganapathy, Naoya Takahashi

In this paper, we propose a model to perform style transfer of speech to singing voice.

Data Augmentation Style Transfer

Paper
Add Code

DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability

no code implementations • 11 Oct 2022 • Kin Wai Cheuk, Ryosuke Sawata, Toshimitsu Uesaka, Naoki Murata, Naoya Takahashi, Shusuke Takahashi, Dorien Herremans, Yuki Mitsufuji

In this paper we propose a novel generative approach, DiffRoll, to tackle automatic music transcription (AMT).

Music Transcription

Paper
Add Code

Hierarchical Diffusion Models for Singing Voice Neural Vocoder

no code implementations • 14 Oct 2022 • Naoya Takahashi, Mayank Kumar, Singh, Yuki Mitsufuji

Recent progress in deep generative models has improved the quality of neural vocoders in speech domain.

Paper
Add Code

Robust One-Shot Singing Voice Conversion

no code implementations • 20 Oct 2022 • Naoya Takahashi, Mayank Kumar Singh, Yuki Mitsufuji

We then propose a two-stage training method called Robustify that train the one-shot SVC model in the first stage on clean data to ensure high-quality conversion, and introduces enhancement modules to the encoders of the model in the second stage to enhance the feature extraction from distorted singing voices.

Voice Conversion

Paper
Add Code

Nonparallel Emotional Voice Conversion For Unseen Speaker-Emotion Pairs Using Dual Domain Adversarial Network & Virtual Domain Pairing

no code implementations • 21 Feb 2023 • Nirmesh Shah, Mayank Kumar Singh, Naoya Takahashi, Naoyuki Onoe

Primary goal of an emotional voice conversion (EVC) system is to convert the emotion of a given speech signal from one style to another style without modifying the linguistic content of the signal.

Voice Conversion

Paper
Add Code

Cross-modal Face- and Voice-style Transfer

no code implementations • 27 Feb 2023 • Naoya Takahashi, Mayank K. Singh, Yuki Mitsufuji

Image-to-image translation and voice conversion enable the generation of a new facial image and voice while maintaining some of the semantics such as a pose in an image and linguistic content in audio, respectively.

Image-to-Image Translation Open-Ended Question Answering +3

Paper
Add Code

Iteratively Improving Speech Recognition and Voice Conversion

no code implementations • 24 May 2023 • Mayank Kumar Singh, Naoya Takahashi, Onoe Naoyuki

Many existing works on voice conversion (VC) tasks use automatic speech recognition (ASR) models for ensuring linguistic consistency between source and converted samples.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.