no code implementations • 2 Feb 2024 • Simon Leglaive, Matthieu Fraticelli, Hend ElGhazaly, Léonie Borne, Mostafa Sadeghi, Scott Wisdom, Manuel Pariente, John R. Hershey, Daniel Pressnitzer, Jon P. Barker
In this paper, we present the objective and subjective evaluations of the systems that were submitted to the CHiME-7 UDASE task, and we provide an analysis of the results.
no code implementations • 21 Aug 2023 • Hakan Erdogan, Scott Wisdom, Xuankai Chang, Zalán Borsos, Marco Tagliasacchi, Neil Zeghidour, John R. Hershey
The model operates on transcripts and audio token sequences and achieves multiple tasks through masking of inputs.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 9 May 2023 • Pradyumna Reddy, Scott Wisdom, Klaus Greff, John R. Hershey, Thomas Kipf
We discuss the results and limitations of our approach in detail, and further outline potential ways to overcome the limitations and directions for future work.
no code implementations • 20 Jul 2022 • Efthymios Tzinis, Scott Wisdom, Tal Remez, John R. Hershey
We identify several limitations of previous work on audio-visual on-screen sound separation, including the coarse resolution of spatio-temporal attention, poor convergence of the audio separation model, limited variety in training and evaluation data, and failure to account for the trade off between preservation of on-screen sounds and suppression of off-screen sounds.
no code implementations • 29 Mar 2022 • Hannah Muckenhirn, Aleksandr Safin, Hakan Erdogan, Felix de Chaumont Quitry, Marco Tagliasacchi, Scott Wisdom, John R. Hershey
Typically, neural network-based speech dereverberation models are trained on paired data, composed of a dry utterance and its corresponding reverberant utterance.
no code implementations • 7 Oct 2021 • Tom Denton, Scott Wisdom, John R. Hershey
This paper addresses the problem of species classification in bird song recordings.
no code implementations • 30 Jun 2021 • Yuma Koizumi, Shigeki Karita, Scott Wisdom, Hakan Erdogan, John R. Hershey, Llion Jones, Michiel Bacchiani
To make the model computationally feasible, we extend the Conformer using linear complexity attention and stacked 1-D dilated depthwise convolution layers.
no code implementations • 17 Jun 2021 • Efthymios Tzinis, Scott Wisdom, Tal Remez, John R. Hershey
We introduce a state-of-the-art audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos.
no code implementations • 1 Jun 2021 • Scott Wisdom, Aren Jansen, Ron J. Weiss, Hakan Erdogan, John R. Hershey
The best performance is achieved using larger numbers of output sources, enabled by our efficient MixIT loss, combined with sparsity losses to prevent over-separation.
1 code implementation • 5 May 2021 • Eduardo Fonseca, Aren Jansen, Daniel P. W. Ellis, Scott Wisdom, Marco Tagliasacchi, John R. Hershey, Manoj Plakal, Shawn Hershey, R. Channing Moore, Xavier Serra
Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings.
no code implementations • 5 May 2021 • Soumi Maiti, Hakan Erdogan, Kevin Wilson, Scott Wisdom, Shinji Watanabe, John R. Hershey
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
no code implementations • 17 Dec 2020 • Cong Han, Yi Luo, Chenda Li, Tianyan Zhou, Keisuke Kinoshita, Shinji Watanabe, Marc Delcroix, Hakan Erdogan, John R. Hershey, Nima Mesgarani, Zhuo Chen
Leveraging additional speaker information to facilitate speech separation has received increasing attention in recent years.
no code implementations • 3 Nov 2020 • Desh Raj, Pavel Denisov, Zhuo Chen, Hakan Erdogan, Zili Huang, Maokui He, Shinji Watanabe, Jun Du, Takuya Yoshioka, Yi Luo, Naoyuki Kanda, Jinyu Li, Scott Wisdom, John R. Hershey
Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • ICLR 2021 • Efthymios Tzinis, Scott Wisdom, Aren Jansen, Shawn Hershey, Tal Remez, Daniel P. W. Ellis, John R. Hershey
For evaluation and semi-supervised experiments, we collected human labels for presence of on-screen and off-screen sounds on a small subset of clips.
no code implementations • NeurIPS 2020 • Scott Wisdom, Efthymios Tzinis, Hakan Erdogan, Ron J. Weiss, Kevin Wilson, John R. Hershey
In such supervised approaches, a model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources.
no code implementations • 18 Nov 2019 • Zhong-Qiu Wang, Hakan Erdogan, Scott Wisdom, Kevin Wilson, Desh Raj, Shinji Watanabe, Zhuo Chen, John R. Hershey
This work introduces sequential neural beamforming, which alternates between neural network based spectral separation and beamforming based spatial separation.
no code implementations • 18 Nov 2019 • Efthymios Tzinis, Scott Wisdom, John R. Hershey, Aren Jansen, Daniel P. W. Ellis
Deep learning approaches have recently achieved impressive performance on both audio source separation and sound classification.
no code implementations • 8 May 2019 • Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, John R. Hershey
For learnable bases, shorter windows (2. 5 ms) work best on all tasks.
no code implementations • 20 Nov 2018 • Scott Wisdom, John R. Hershey, Kevin Wilson, Jeremy Thorpe, Michael Chinen, Brian Patton, Rif A. Saurous
Furthermore, the only previous approaches that apply mixture consistency use real-valued masks; mixture consistency has been ignored for complex-valued masks.
Sound Audio and Speech Processing
1 code implementation • 6 Nov 2018 • Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, John R. Hershey
In speech enhancement and source separation, signal-to-noise ratio is a ubiquitous objective measure of denoising/separation quality.
Sound Audio and Speech Processing
no code implementations • 2 Oct 2018 • Jonathan Le Roux, Gordon Wichern, Shinji Watanabe, Andy Sarroff, John R. Hershey
Here, we propose "magbook", "phasebook", and "combook", three new types of layers based on discrete representations that can be used to estimate complex time-frequency masks.
no code implementations • 27 Sep 2018 • Hiroshi Seki, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux, John R. Hershey
Several multi-lingual ASR systems were recently proposed based on a monolithic neural network architecture without language-dependent modules, showing that modeling of multiple languages is well within the capabilities of an end-to-end framework.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • ACL 2018 • Hiroshi Seki, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux, John R. Hershey
In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner.
no code implementations • 26 Apr 2018 • Zhong-Qiu Wang, Jonathan Le Roux, DeLiang Wang, John R. Hershey
In addition, we train through unfolded iterations of a phase reconstruction algorithm, represented as a series of STFT and inverse STFT layers.
no code implementations • 21 Nov 2017 • Zhong Meng, Shinji Watanabe, John R. Hershey, Hakan Erdogan
Further, we use hidden units in the deep LSTM acoustic model to assist in predicting the beamforming filter coefficients.
no code implementations • ICML 2017 • Tsubasa Ochiai, Shinji Watanabe, Takaaki Hori, John R. Hershey
The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance of hidden Markov models as a core technology.
no code implementations • ICCV 2017 • Chiori Hori, Takaaki Hori, Teng-Yok Lee, Kazuhiro Sumi, John R. Hershey, Tim K. Marks
Currently successful methods for video description are based on encoder-decoder sentence generation using recur-rent neural networks (RNNs).
no code implementations • 18 Nov 2016 • Yi Luo, Zhuo Chen, John R. Hershey, Jonathan Le Roux, Nima Mesgarani
Deep clustering is the first method to handle general audio separation scenarios with multiple sources of the same type and an arbitrary number of sources, performing impressively in speaker-independent speech separation tasks.
2 code implementations • NeurIPS 2016 • Scott Wisdom, Thomas Powers, John R. Hershey, Jonathan Le Roux, Les Atlas
To address this question, we propose full-capacity uRNNs that optimize their recurrence matrix over all unitary matrices, leading to significantly improved performance over uRNNs that use a restricted-capacity recurrence matrix.
Ranked #25 on Sequential Image Classification on Sequential MNIST
Open-Ended Question Answering Sequential Image Classification
2 code implementations • 7 Jul 2016 • Yusuf Isik, Jonathan Le Roux, Zhuo Chen, Shinji Watanabe, John R. Hershey
In this paper we extend the baseline system with an end-to-end signal approximation objective that greatly improves performance on a challenging speech separation.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +5
no code implementations • 23 Mar 2016 • Oncel Tuzel, Yuichi Taguchi, John R. Hershey
In our deep network architecture the global and local constraints that define a face can be efficiently modeled and learned end-to-end using training data.
8 code implementations • 18 Aug 2015 • John R. Hershey, Zhuo Chen, Jonathan Le Roux, Shinji Watanabe
The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources.
Ranked #30 on Speech Separation on WSJ0-2mix
no code implementations • 9 Sep 2014 • John R. Hershey, Jonathan Le Roux, Felix Weninger
Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm.