Search Results for author: John R. Hershey

Found 34 papers, 5 papers with code

AudioSlots: A slot-centric generative model for audio separation

no code implementations9 May 2023 Pradyumna Reddy, Scott Wisdom, Klaus Greff, John R. Hershey, Thomas Kipf

We discuss the results and limitations of our approach in detail, and further outline potential ways to overcome the limitations and directions for future work.

blind source separation Speech Separation

AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation

no code implementations20 Jul 2022 Efthymios Tzinis, Scott Wisdom, Tal Remez, John R. Hershey

We identify several limitations of previous work on audio-visual on-screen sound separation, including the coarse resolution of spatio-temporal attention, poor convergence of the audio separation model, limited variety in training and evaluation data, and failure to account for the trade off between preservation of on-screen sounds and suppression of off-screen sounds.

CycleGAN-Based Unpaired Speech Dereverberation

no code implementations29 Mar 2022 Hannah Muckenhirn, Aleksandr Safin, Hakan Erdogan, Felix de Chaumont Quitry, Marco Tagliasacchi, Scott Wisdom, John R. Hershey

Typically, neural network-based speech dereverberation models are trained on paired data, composed of a dry utterance and its corresponding reverberant utterance.

Speech Dereverberation

Improving Bird Classification with Unsupervised Sound Separation

no code implementations7 Oct 2021 Tom Denton, Scott Wisdom, John R. Hershey

This paper addresses the problem of species classification in bird song recordings.

Classification

Improving On-Screen Sound Separation for Open-Domain Videos with Audio-Visual Self-Attention

no code implementations17 Jun 2021 Efthymios Tzinis, Scott Wisdom, Tal Remez, John R. Hershey

We introduce a state-of-the-art audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos.

Unsupervised Pre-training

Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation

no code implementations1 Jun 2021 Scott Wisdom, Aren Jansen, Ron J. Weiss, Hakan Erdogan, John R. Hershey

The best performance is achieved using larger numbers of output sources, enabled by our efficient MixIT loss, combined with sparsity losses to prevent over-separation.

Self-Supervised Learning from Automatically Separated Sound Scenes

1 code implementation5 May 2021 Eduardo Fonseca, Aren Jansen, Daniel P. W. Ellis, Scott Wisdom, Marco Tagliasacchi, John R. Hershey, Manoj Plakal, Shawn Hershey, R. Channing Moore, Xavier Serra

Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings.

Contrastive Learning Self-Supervised Learning

Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

no code implementations ICLR 2021 Efthymios Tzinis, Scott Wisdom, Aren Jansen, Shawn Hershey, Tal Remez, Daniel P. W. Ellis, John R. Hershey

For evaluation and semi-supervised experiments, we collected human labels for presence of on-screen and off-screen sounds on a small subset of clips.

Scene Understanding

Unsupervised Sound Separation Using Mixture Invariant Training

no code implementations NeurIPS 2020 Scott Wisdom, Efthymios Tzinis, Hakan Erdogan, Ron J. Weiss, Kevin Wilson, John R. Hershey

In such supervised approaches, a model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources.

Speech Enhancement Speech Separation +1

Sequential Multi-Frame Neural Beamforming for Speech Separation and Enhancement

no code implementations18 Nov 2019 Zhong-Qiu Wang, Hakan Erdogan, Scott Wisdom, Kevin Wilson, Desh Raj, Shinji Watanabe, Zhuo Chen, John R. Hershey

This work introduces sequential neural beamforming, which alternates between neural network based spectral separation and beamforming based spatial separation.

Speaker Separation Speech Enhancement +3

Improving Universal Sound Separation Using Sound Classification

no code implementations18 Nov 2019 Efthymios Tzinis, Scott Wisdom, John R. Hershey, Aren Jansen, Daniel P. W. Ellis

Deep learning approaches have recently achieved impressive performance on both audio source separation and sound classification.

Audio Source Separation Classification +2

Differentiable Consistency Constraints for Improved Deep Speech Enhancement

no code implementations20 Nov 2018 Scott Wisdom, John R. Hershey, Kevin Wilson, Jeremy Thorpe, Michael Chinen, Brian Patton, Rif A. Saurous

Furthermore, the only previous approaches that apply mixture consistency use real-valued masks; mixture consistency has been ignored for complex-valued masks.

Sound Audio and Speech Processing

SDR - half-baked or well done?

1 code implementation6 Nov 2018 Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, John R. Hershey

In speech enhancement and source separation, signal-to-noise ratio is a ubiquitous objective measure of denoising/separation quality.

Sound Audio and Speech Processing

Phasebook and Friends: Leveraging Discrete Representations for Source Separation

no code implementations2 Oct 2018 Jonathan Le Roux, Gordon Wichern, Shinji Watanabe, Andy Sarroff, John R. Hershey

Here, we propose "magbook", "phasebook", and "combook", three new types of layers based on discrete representations that can be used to estimate complex time-frequency masks.

Speaker Separation Speech Enhancement

End-to-End Multi-Lingual Multi-Speaker Speech Recognition

no code implementations27 Sep 2018 Hiroshi Seki, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux, John R. Hershey

Several multi-lingual ASR systems were recently proposed based on a monolithic neural network architecture without language-dependent modules, showing that modeling of multiple languages is well within the capabilities of an end-to-end framework.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

A Purely End-to-end System for Multi-speaker Speech Recognition

no code implementations ACL 2018 Hiroshi Seki, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux, John R. Hershey

In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner.

speech-recognition Speech Recognition

End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction

no code implementations26 Apr 2018 Zhong-Qiu Wang, Jonathan Le Roux, DeLiang Wang, John R. Hershey

In addition, we train through unfolded iterations of a phase reconstruction algorithm, represented as a series of STFT and inverse STFT layers.

Speech Separation

Multichannel End-to-end Speech Recognition

no code implementations ICML 2017 Tsubasa Ochiai, Shinji Watanabe, Takaaki Hori, John R. Hershey

The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance of hidden Markov models as a core technology.

Language Modelling Speech Enhancement +2

Attention-Based Multimodal Fusion for Video Description

no code implementations ICCV 2017 Chiori Hori, Takaaki Hori, Teng-Yok Lee, Kazuhiro Sumi, John R. Hershey, Tim K. Marks

Currently successful methods for video description are based on encoder-decoder sentence generation using recur-rent neural networks (RNNs).

Sentence Video Description

Deep Clustering and Conventional Networks for Music Separation: Stronger Together

no code implementations18 Nov 2016 Yi Luo, Zhuo Chen, John R. Hershey, Jonathan Le Roux, Nima Mesgarani

Deep clustering is the first method to handle general audio separation scenarios with multiple sources of the same type and an arbitrary number of sources, performing impressively in speaker-independent speech separation tasks.

Clustering Deep Clustering +3

Full-Capacity Unitary Recurrent Neural Networks

2 code implementations NeurIPS 2016 Scott Wisdom, Thomas Powers, John R. Hershey, Jonathan Le Roux, Les Atlas

To address this question, we propose full-capacity uRNNs that optimize their recurrence matrix over all unitary matrices, leading to significantly improved performance over uRNNs that use a restricted-capacity recurrence matrix.

Open-Ended Question Answering Sequential Image Classification

Single-Channel Multi-Speaker Separation using Deep Clustering

2 code implementations7 Jul 2016 Yusuf Isik, Jonathan Le Roux, Zhuo Chen, Shinji Watanabe, John R. Hershey

In this paper we extend the baseline system with an end-to-end signal approximation objective that greatly improves performance on a challenging speech separation.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Global-Local Face Upsampling Network

no code implementations23 Mar 2016 Oncel Tuzel, Yuichi Taguchi, John R. Hershey

In our deep network architecture the global and local constraints that define a face can be efficiently modeled and learned end-to-end using training data.

Face Hallucination Face Reconstruction +2

Deep clustering: Discriminative embeddings for segmentation and separation

8 code implementations18 Aug 2015 John R. Hershey, Zhuo Chen, Jonathan Le Roux, Shinji Watanabe

The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources.

Clustering Deep Clustering +3

Deep Unfolding: Model-Based Inspiration of Novel Deep Architectures

no code implementations9 Sep 2014 John R. Hershey, Jonathan Le Roux, Felix Weninger

Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm.

Speech Enhancement

Cannot find the paper you are looking for? You can Submit a new open access paper.