1 code implementation • NeurIPS 2023 • Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, Yuki Mitsufuji
While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e. g., sounds of footsteps come from the feet of a walker.
no code implementations • 24 May 2023 • Mayank Kumar Singh, Naoya Takahashi, Onoe Naoyuki
Many existing works on voice conversion (VC) tasks use automatic speech recognition (ASR) models for ensuring linguistic consistency between source and converted samples.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
1 code implementation • 13 May 2023 • Ryosuke Sawata, Naoya Takahashi, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji
We modify the target network, i. e., the network architecture of the original DNN-based MSS, by adding bridging paths for each output instrument to share their information.
no code implementations • 27 Feb 2023 • Naoya Takahashi, Mayank K. Singh, Yuki Mitsufuji
Image-to-image translation and voice conversion enable the generation of a new facial image and voice while maintaining some of the semantics such as a pose in an image and linguistic content in audio, respectively.
no code implementations • 21 Feb 2023 • Nirmesh Shah, Mayank Kumar Singh, Naoya Takahashi, Naoyuki Onoe
Primary goal of an emotional voice conversion (EVC) system is to convert the emotion of a given speech signal from one style to another style without modifying the linguistic content of the signal.
1 code implementation • 14 Dec 2022 • Hao-Wen Dong, Naoya Takahashi, Yuki Mitsufuji, Julian McAuley, Taylor Berg-Kirkpatrick
Further, videos in the wild often contain off-screen sounds and background noise that may hinder the model from learning the desired audio-textual correspondence.
no code implementations • 20 Oct 2022 • Naoya Takahashi, Mayank Kumar Singh, Yuki Mitsufuji
We then propose a two-stage training method called Robustify that train the one-shot SVC model in the first stage on clean data to ensure high-quality conversion, and introduces enhancement modules to the encoders of the model in the second stage to enhance the feature extraction from distorted singing voices.
no code implementations • 14 Oct 2022 • Naoya Takahashi, Mayank Kumar, Singh, Yuki Mitsufuji
Recent progress in deep generative models has improved the quality of neural vocoders in speech domain.
no code implementations • 11 Oct 2022 • Kin Wai Cheuk, Ryosuke Sawata, Toshimitsu Uesaka, Naoki Murata, Naoya Takahashi, Shusuke Takahashi, Dorien Herremans, Yuki Mitsufuji
In this paper we propose a novel generative approach, DiffRoll, to tackle automatic music transcription (AMT).
no code implementations • 26 Aug 2022 • Shrutina Agarwal, Sriram Ganapathy, Naoya Takahashi
In this paper, we propose a model to perform style transfer of speech to singing voice.
2 code implementations • 4 Jun 2022 • Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen
Additionally, the report presents the baseline system that accompanies the dataset in the challenge with emphasis on the differences with the baseline of the previous iterations; namely, introduction of the multi-ACCDOA representation to handle multiple simultaneous occurences of events of the same class, and support for additional improved input features for the microphone array format.
Ranked #1 on Sound Event Localization and Detection on STARSS22
2 code implementations • 14 Oct 2021 • Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, Naoya Takahashi, Emiru Tsunoo, Yuki Mitsufuji
The multi- ACCDOA format (a class- and track-wise output format) enables the model to solve the cases with overlaps from the same class.
no code implementations • 21 Jun 2021 • Kazuki Shimada, Naoya Takahashi, Yuichiro Koyama, Shusuke Takahashi, Emiru Tsunoo, Masafumi Takahashi, Yuki Mitsufuji
This report describes our systems submitted to the DCASE2021 challenge task 3: sound event localization and detection (SELD) with directional interference.
1 code implementation • CVPR 2021 • Naoya Takahashi, Yuki Mitsufuji
In this paper, we claim the importance of a dense simultaneous modeling of multiresolution representation and propose a novel CNN architecture called densely connected multidilated DenseNet (D3Net).
1 code implementation • 17 Feb 2021 • Sakya Basak, Shrutina Agarwal, Sriram Ganapathy, Naoya Takahashi
This approach, called voice to singing (V2S), performs the voice style conversion by modulating the F0 contour of the natural speech with that of a singing voice.
no code implementations • 18 Jan 2021 • Naoya Takahashi, Mayank Kumar Singh, Yuki Mitsufuji
Conventional singing voice conversion (SVC) methods often suffer from operating in high-resolution audio owing to a high dimensionality of data.
1 code implementation • 21 Nov 2020 • Naoya Takahashi, Yuki Mitsufuji
In this paper, we claim the importance of a dense simultaneous modeling of multiresolution representation and propose a novel CNN architecture called densely connected multidilated DenseNet (D3Net).
Ranked #48 on Semantic Segmentation on Cityscapes test
2 code implementations • 29 Oct 2020 • Kazuki Shimada, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji
Conventional NN-based methods use two branches for a sound event detection (SED) target and a direction-of-arrival (DOA) target.
no code implementations • 7 Oct 2020 • Naoya Takahashi, Shota Inoue, Yuki Mitsufuji
Despite the excellent performance of neural-network-based audio source separation methods and their wide range of applications, their robustness against intentional attacks has been largely neglected.
1 code implementation • 5 Oct 2020 • Naoya Takahashi, Yuki Mitsufuji
In this paper, we claim the importance of a rapid growth of a receptive field and a simultaneous modeling of multi-resolution data in a single convolution layer, and propose a novel CNN architecture called densely connected dilated DenseNet (D3Net).
Ranked #12 on Music Source Separation on MUSDB18 (using extra training data)
1 code implementation • 29 Nov 2019 • Naoya Takahashi, Mayank Kumar Singh, Sakya Basak, Parthasaarathy Sudarsanam, Sriram Ganapathy, Yuki Mitsufuji
Despite recent advances in voice separation methods, many challenges remain in realistic scenarios such as noisy recording and the limits of available data.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
1 code implementation • 7 May 2018 • Naoya Takahashi, Nabarun Goswami, Yuki Mitsufuji
Deep neural networks have become an indispensable technique for audio source separation (ASS).
Ranked #17 on Music Source Separation on MUSDB18 (using extra training data)
Music Source Separation Sound Audio and Speech Processing
5 code implementations • 29 Jun 2017 • Naoya Takahashi, Yuki Mitsufuji
This paper deals with the problem of audio source separation.
1 code implementation • 3 Jan 2017 • Naoya Takahashi, Michael Gygli, Luc van Gool
Instead, combining visual features with our AENet features, which can be computed efficiently on a GPU, leads to significant performance improvements on action recognition and video highlight detection.
no code implementations • 15 Jun 2016 • Naoya Takahashi, Tofigh Naghibi, Beat Pfister
Phonemic or phonetic sub-word units are the most commonly used atomic elements to represent speech signals in modern ASRs.
no code implementations • 25 Apr 2016 • Naoya Takahashi, Michael Gygli, Beat Pfister, Luc van Gool
We propose a novel method for Acoustic Event Detection (AED).
Sound Multimedia