no code implementations • 28 Feb 2024 • Chang-Bin Jeon, Gordon Wichern, François G. Germain, Jonathan Le Roux
In music source separation, a standard training data augmentation procedure is to create new training samples by randomly combining instrument stems from different songs.
1 code implementation • 27 Feb 2024 • Yoshiki Masuyama, Gordon Wichern, François G. Germain, Zexu Pan, Sameer Khurana, Chiori Hori, Jonathan Le Roux
Existing NF-based methods focused on estimating the magnitude of the HRTF from a given sound source direction, and the magnitude is converted to a finite impulse response (FIR) filter.
no code implementations • 9 Feb 2024 • Haocheng Liu, Teysir Baoueb, Mathieu Fontaine, Jonathan Le Roux, Gael Richard
Diffusion models are receiving a growing interest for a variety of signal generation tasks such as speech or music synthesis.
no code implementations • 30 Jan 2024 • Teysir Baoueb, Haocheng Liu, Mathieu Fontaine, Jonathan Le Roux, Gael Richard
Generative adversarial network (GAN) models can synthesize highquality audio signals while ensuring fast sample generation.
no code implementations • 12 Dec 2023 • Zexu Pan, Gordon Wichern, Francois G. Germain, Sameer Khurana, Jonathan Le Roux
Neuro-steered speaker extraction aims to extract the listener's brain-attended speech signal from a multi-talker speech signal, in which the attention is derived from the cortical activity.
no code implementations • 30 Oct 2023 • Zexu Pan, Gordon Wichern, Yoshiki Masuyama, Francois G. Germain, Sameer Khurana, Chiori Hori, Jonathan Le Roux
Target speech extraction aims to extract, based on a given conditioning cue, a target speech signal that is corrupted by interfering sources, such as noise or competing speakers.
no code implementations • 16 Oct 2023 • Dimitrios Bralios, Gordon Wichern, François G. Germain, Zexu Pan, Sameer Khurana, Chiori Hori, Jonathan Le Roux
The introduction of audio latent diffusion models possessing the ability to generate realistic sound clips on demand from a text description has the potential to revolutionize how we work with audio.
1 code implementation • 14 Aug 2023 • Stefan Uhlich, Giorgio Fabbro, Masato Hirano, Shusuke Takahashi, Gordon Wichern, Jonathan Le Roux, Dipam Chakraborty, Sharada Mohanty, Kai Li, Yi Luo, Jianwei Yu, Rongzhi Gu, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Mikhail Sukhovei, Yuki Mitsufuji
A significant source of this improvement was making the simulated data better match real cinematic audio, which we further investigate in detail.
no code implementations • 27 Jun 2023 • Chiori Hori, Puyuan Peng, David Harwath, Xinyu Liu, Kei Ota, Siddarth Jain, Radu Corcodel, Devesh Jha, Diego Romeres, Jonathan Le Roux
This paper introduces a method for robot action sequence generation from instruction videos using (1) an audio-visual Transformer that converts audio-visual features and instruction speech to a sequence of robot actions called dynamic movement primitives (DMPs) and (2) style-transfer-based training that employs multi-task learning with video captioning and weakly-supervised learning with a semantic classifier to exploit unpaired video-action data.
no code implementations • 4 Apr 2023 • Ke Chen, Gordon Wichern, François G. Germain, Jonathan Le Roux
In this paper, we propose a self-supervised learning framework for music source separation inspired by the HuBERT speech representation model.
no code implementations • 7 Mar 2023 • Christoph Boeddeker, Aswin Shanmugam Subramanian, Gordon Wichern, Reinhold Haeb-Umbach, Jonathan Le Roux
Since diarization and source separation of meeting data are closely related tasks, we here propose an approach to perform the two objectives jointly.
Ranked #1 on Speech Recognition on LibriCSS (using extra training data)
no code implementations • 14 Dec 2022 • Darius Petermann, Gordon Wichern, Aswin Shanmugam Subramanian, Zhong-Qiu Wang, Jonathan Le Roux
In this paper, we focus on the cocktail fork problem, which takes a three-pronged approach to source separation by separating an audio mixture such as a movie soundtrack or podcast into the three broad categories of speech, music, and sound effects (SFX - understood to include ambient noise and natural sound events).
no code implementations • 9 Dec 2022 • Darius Petermann, Gordon Wichern, Aswin Subramanian, Jonathan Le Roux
We introduce a framework for audio source separation using embeddings on a hyperbolic manifold that compactly represent the hierarchical relationship between sound sources and time-frequency features.
1 code implementation • 22 Nov 2022 • Dimitrios Bralios, Efthymios Tzinis, Gordon Wichern, Paris Smaragdis, Jonathan Le Roux
During inference, we can dynamically adjust how many processing blocks and iterations of a specific block an input signal needs using a gating module.
no code implementations • 15 Nov 2022 • Rohith Aralikatti, Christoph Boeddeker, Gordon Wichern, Aswin Shanmugam Subramanian, Jonathan Le Roux
This paper proposes reverberation as supervision (RAS), a novel unsupervised loss function for single-channel reverberant speech separation.
1 code implementation • 11 Nov 2022 • Efthymios Tzinis, Gordon Wichern, Paris Smaragdis, Jonathan Le Roux
Recent research has shown remarkable performance in leveraging multiple extraneous conditional and non-mutually exclusive semantic concepts for sound source separation, allowing the flexibility to extract a given target source based on multiple different queries.
no code implementations • 4 Nov 2022 • Hao Yen, François G. Germain, Gordon Wichern, Jonathan Le Roux
Diffusion models have recently shown promising results for difficult enhancement tasks such as the conditional and unconditional restoration of natural images and audio signals.
no code implementations • 2 Nov 2022 • Zexu Pan, Gordon Wichern, François G. Germain, Aswin Subramanian, Jonathan Le Roux
Speaker diarization is well studied for constrained audios but little explored for challenging in-the-wild videos, which have more speakers, shorter utterances, and inconsistent on-screen speakers.
no code implementations • 7 Apr 2022 • Efthymios Tzinis, Gordon Wichern, Aswin Subramanian, Paris Smaragdis, Jonathan Le Roux
We introduce a new paradigm for single-channel target source separation where the sources of interest can be distinguished using non-mutually exclusive concepts (e. g., loudness, gender, language, spatial location, etc).
no code implementations • 8 Mar 2022 • Olga Slizovskaia, Gordon Wichern, Zhong-Qiu Wang, Jonathan Le Roux
Existing systems for sound event localization and detection (SELD) typically operate by estimating a source location for all classes at every time instant.
no code implementations • 1 Mar 2022 • Xuankai Chang, Niko Moritz, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux
As an example application, we use the extended GTC (GTC-e) for the multi-speaker speech recognition task.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 18 Feb 2022 • Anoop Cherian, Chiori Hori, Tim K. Marks, Jonathan Le Roux
Spatio-temporal scene-graph approaches to video-based reasoning tasks, such as video question-answering (QA), typically construct such graphs for every video frame.
Ranked #23 on Video Question Answering on NExT-QA
no code implementations • 1 Nov 2021 • Niko Moritz, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux
The recurrent neural network transducer (RNN-T) objective plays a major role in building today's best automatic speech recognition (ASR) systems for production.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
3 code implementations • 19 Oct 2021 • Darius Petermann, Gordon Wichern, Zhong-Qiu Wang, Jonathan Le Roux
The cocktail party problem aims at isolating any source of interest within a complex acoustic scene, and has long inspired audio source separation research.
no code implementations • 13 Oct 2021 • Ankit P. Shah, Shijie Geng, Peng Gao, Anoop Cherian, Takaaki Hori, Tim K. Marks, Jonathan Le Roux, Chiori Hori
In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD) task, collected an AVSD dataset, developed AVSD technologies, and hosted an AVSD challenge track at both the 7th and 8th Dialog System Technology Challenges (DSTC7, DSTC8).
no code implementations • 11 Oct 2021 • Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, Takaaki Hori
Pseudo-labeling (PL), a semi-supervised learning (SSL) method where a seed model performs self-training using pseudo-labels generated from untranscribed speech, has been shown to enhance the performance of end-to-end automatic speech recognition (ASR).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • ICCV 2021 • Moitreya Chatterjee, Jonathan Le Roux, Narendra Ahuja, Anoop Cherian
At its core, AVSGS uses a recursive neural network that emits mutually-orthogonal sub-graph embeddings of the visual graph using multi-head attention.
no code implementations • 4 Aug 2021 • Chiori Hori, Takaaki Hori, Jonathan Le Roux
A CNN-based timing detector is also trained to detect a proper output timing, where the captions generated by the two Trans-formers become sufficiently close to each other.
no code implementations • 2 Jul 2021 • Niko Moritz, Takaaki Hori, Jonathan Le Roux
Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 16 Jun 2021 • Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, Takaaki Hori
MPL consists of a pair of online and offline models that interact and learn from each other, inspired by the mean teacher method.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 19 Apr 2021 • Takaaki Hori, Niko Moritz, Chiori Hori, Jonathan Le Roux
In this paper, we extend our prior work by (1) introducing the Conformer architecture to further improve the accuracy, (2) accelerating the decoding process with a novel activation recycling technique, and (3) enabling streaming decoding with triggered attention.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 7 Apr 2021 • Niko Moritz, Takaaki Hori, Jonathan Le Roux
The restricted self-attention allows attention to neighboring frames of the query at a high resolution, and the dilation mechanism summarizes distant information to allow attending to it with a lower resolution.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 26 Nov 2020 • Sameer Khurana, Niko Moritz, Takaaki Hori, Jonathan Le Roux
The performance of automatic speech recognition (ASR) systems typically degrades significantly when the training and test data domains are mismatched.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 29 Oct 2020 • Niko Moritz, Takaaki Hori, Jonathan Le Roux
However, alternative ASR hypotheses of an N-best list can provide more accurate labels for an unlabeled speech utterance and also reflect uncertainties of the seed ASR model.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 22 Oct 2020 • Yun-Ning Hung, Gordon Wichern, Jonathan Le Roux
Most music source separation systems require large collections of isolated sources for training, which can be difficult to obtain.
no code implementations • 23 Sep 2020 • Peng Gao, Chiori Hori, Shijie Geng, Takaaki Hori, Jonathan Le Roux
In contrast with previous approaches where information flows only towards deeper layers of a stack, we consider a multi-pass transformer (MPT) architecture in which earlier layers are allowed to process information in light of the output of later layers.
1 code implementation • 25 Jul 2020 • Prem Seetharaman, Gordon Wichern, Bryan Pardo, Jonathan Le Roux
Clipping the gradient is a known approach to improving gradient descent, but requires hand selection of a clipping threshold hyperparameter.
no code implementations • 8 Jul 2020 • Shijie Geng, Peng Gao, Moitreya Chatterjee, Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li, Anoop Cherian
Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human about the audio-visual content.
no code implementations • 2 Jun 2020 • Tejas Jayashankar, Jonathan Le Roux, Pierre Moulin
Various adversarial audio attacks have recently been developed to fool automatic speech recognition (ASR) systems.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 14 Feb 2020 • Leda Sari, Niko Moritz, Takaaki Hori, Jonathan Le Roux
We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 10 Feb 2020 • Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe
Recently, fully recurrent neural network (RNN) based end-to-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios.
no code implementations • 8 Jan 2020 • Niko Moritz, Takaaki Hori, Jonathan Le Roux
Encoder-decoder based sequence-to-sequence models have demonstrated state-of-the-art results in end-to-end automatic speech recognition (ASR).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 6 Nov 2019 • Fatemeh Pishdadian, Gordon Wichern, Jonathan Le Roux
In this scenario, weak labels are defined in contrast with strong time-frequency (TF) labels such as those obtained from isolated sources, and refer either to frame-level weak labels where one only has access to the time periods when different sources are active in an audio mixture, or to clip-level weak labels that only indicate the presence or absence of sounds in an entire audio clip.
no code implementations • 23 Oct 2019 • Prem Seetharaman, Gordon Wichern, Jonathan Le Roux, Bryan Pardo
They are trained on synthetic mixtures of audio made from isolated sound source recordings so that ground truth for the separation is known.
no code implementations • 22 Oct 2019 • Matthew Maciejewski, Gordon Wichern, Emmett McQuinn, Jonathan Le Roux
While significant advances have been made with respect to the separation of overlapping speech signals, studies have been largely constrained to mixtures of clean, near anechoic speech, not representative of many real-world scenarios.
Sound Audio and Speech Processing
no code implementations • 15 Oct 2019 • Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe
In this work, we propose a novel neural sequence-to-sequence (seq2seq) architecture, MIMO-Speech, which extends the original seq2seq to deal with multi-channel input and multi-channel output so that it can fully model multi-channel multi-speaker speech separation and recognition.
no code implementations • 18 Sep 2019 • Ethan Manilow, Gordon Wichern, Prem Seetharaman, Jonathan Le Roux
In this paper, we present the synthesized Lakh dataset (Slakh) as a new tool for music source separation research.
1 code implementation • 2 Jul 2019 • Gordon Wichern, Joe Antognini, Michael Flynn, Licheng Richard Zhu, Emmett McQuinn, Dwight Crow, Ethan Manilow, Jonathan Le Roux
Recent progress in separating the speech signals from multiple overlapping speakers using a single audio channel has brought us closer to solving the cocktail party problem.
Ranked #15 on Speech Separation on WHAMR!
no code implementations • 8 May 2019 • Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, John R. Hershey
For learnable bases, shorter windows (2. 5 ms) work best on all tasks.
no code implementations • 7 Nov 2018 • Prem Seetharaman, Gordon Wichern, Shrikant Venkataramani, Jonathan Le Roux
Isolating individual instruments in a musical mixture has a myriad of potential applications, and seems imminently achievable given the levels of performance reached by recent deep learning methods.
no code implementations • 6 Nov 2018 • Prem Seetharaman, Gordon Wichern, Jonathan Le Roux, Bryan Pardo
These estimates, together with a weighting scheme in the time-frequency domain, based on confidence in the separation quality, are used to train a deep learning model that can be used for single-channel separation, where no source direction information is available.
1 code implementation • 6 Nov 2018 • Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, John R. Hershey
In speech enhancement and source separation, signal-to-noise ratio is a ubiquitous objective measure of denoising/separation quality.
Sound Audio and Speech Processing
no code implementations • 2 Nov 2018 • Takaaki Hori, Ramon Astudillo, Tomoki Hayashi, Yu Zhang, Shinji Watanabe, Jonathan Le Roux
To solve this problem, this work presents a loss that is based on the speech encoder state sequence instead of the raw speech signal.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 2 Oct 2018 • Jonathan Le Roux, Gordon Wichern, Shinji Watanabe, Andy Sarroff, John R. Hershey
Here, we propose "magbook", "phasebook", and "combook", three new types of layers based on discrete representations that can be used to estimate complex time-frequency masks.
no code implementations • 27 Sep 2018 • Hiroshi Seki, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux, John R. Hershey
Several multi-lingual ASR systems were recently proposed based on a monolithic neural network architecture without language-dependent modules, showing that modeling of multiple languages is well within the capabilities of an end-to-end framework.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • ACL 2018 • Hiroshi Seki, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux, John R. Hershey
In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner.
no code implementations • 26 Apr 2018 • Zhong-Qiu Wang, Jonathan Le Roux, DeLiang Wang, John R. Hershey
In addition, we train through unfolded iterations of a phase reconstruction algorithm, represented as a series of STFT and inverse STFT layers.
no code implementations • 18 Nov 2016 • Yi Luo, Zhuo Chen, John R. Hershey, Jonathan Le Roux, Nima Mesgarani
Deep clustering is the first method to handle general audio separation scenarios with multiple sources of the same type and an arbitrary number of sources, performing impressively in speaker-independent speech separation tasks.
2 code implementations • NeurIPS 2016 • Scott Wisdom, Thomas Powers, John R. Hershey, Jonathan Le Roux, Les Atlas
To address this question, we propose full-capacity uRNNs that optimize their recurrence matrix over all unitary matrices, leading to significantly improved performance over uRNNs that use a restricted-capacity recurrence matrix.
Ranked #25 on Sequential Image Classification on Sequential MNIST
Open-Ended Question Answering Sequential Image Classification
2 code implementations • 7 Jul 2016 • Yusuf Isik, Jonathan Le Roux, Zhuo Chen, Shinji Watanabe, John R. Hershey
In this paper we extend the baseline system with an end-to-end signal approximation objective that greatly improves performance on a challenging speech separation.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +5
8 code implementations • 18 Aug 2015 • John R. Hershey, Zhuo Chen, Jonathan Le Roux, Shinji Watanabe
The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources.
Ranked #30 on Speech Separation on WSJ0-2mix
no code implementations • 9 Sep 2014 • John R. Hershey, Jonathan Le Roux, Felix Weninger
Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm.
1 code implementation • 15 Jan 2013 • Vamsi K. Potluru, Sergey M. Plis, Jonathan Le Roux, Barak A. Pearlmutter, Vince D. Calhoun, Thomas P. Hayes
However, present algorithms designed for optimizing the mixed norm L$_1$/L$_2$ are slow and other formulations for sparse NMF have been proposed such as those based on L$_1$ and L$_0$ norms.