Search Results for author: Jonathan Le Roux

Found 39 papers, 7 papers with code

Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning

no code implementations13 Oct 2021 Ankit P. Shah, Shijie Geng, Peng Gao, Anoop Cherian, Takaaki Hori, Tim K. Marks, Jonathan Le Roux, Chiori Hori

In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD) task, collected an AVSD dataset, developed AVSD technologies, and hosted an AVSD challenge track at both the 7th and 8th Dialog System Technology Challenges (DSTC7, DSTC8).

Region Proposal

Advancing Momentum Pseudo-Labeling with Conformer and Initialization Strategy

no code implementations11 Oct 2021 Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, Takaaki Hori

Pseudo-labeling (PL), a semi-supervised learning (SSL) method where a seed model performs self-training using pseudo-labels generated from untranscribed speech, has been shown to enhance the performance of end-to-end automatic speech recognition (ASR).

automatic-speech-recognition Language Modelling +1

Visual Scene Graphs for Audio Source Separation

no code implementations ICCV 2021 Moitreya Chatterjee, Jonathan Le Roux, Narendra Ahuja, Anoop Cherian

At its core, AVSGS uses a recursive neural network that emits mutually-orthogonal sub-graph embeddings of the visual graph using multi-head attention.

Audio Source Separation

Optimizing Latency for Online Video CaptioningUsing Audio-Visual Transformers

no code implementations4 Aug 2021 Chiori Hori, Takaaki Hori, Jonathan Le Roux

A CNN-based timing detector is also trained to detect a proper output timing, where the captions generated by the two Trans-formers become sufficiently close to each other.

Video Captioning

Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition

no code implementations2 Jul 2021 Niko Moritz, Takaaki Hori, Jonathan Le Roux

Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks.

automatic-speech-recognition End-To-End Speech Recognition +1

Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition

no code implementations16 Jun 2021 Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, Takaaki Hori

MPL consists of a pair of online and offline models that interact and learn from each other, inspired by the mean teacher method.

automatic-speech-recognition End-To-End Speech Recognition +1

Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers

no code implementations19 Apr 2021 Takaaki Hori, Niko Moritz, Chiori Hori, Jonathan Le Roux

In this paper, we extend our prior work by (1) introducing the Conformer architecture to further improve the accuracy, (2) accelerating the decoding process with a novel activation recycling technique, and (3) enabling streaming decoding with triggered attention.

automatic-speech-recognition End-To-End Speech Recognition +1

Capturing Multi-Resolution Context by Dilated Self-Attention

no code implementations7 Apr 2021 Niko Moritz, Takaaki Hori, Jonathan Le Roux

The restricted self-attention allows attention to neighboring frames of the query at a high resolution, and the dilation mechanism summarizes distant information to allow attending to it with a lower resolution.

automatic-speech-recognition Machine Translation +2

Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training

no code implementations26 Nov 2020 Sameer Khurana, Niko Moritz, Takaaki Hori, Jonathan Le Roux

The performance of automatic speech recognition (ASR) systems typically degrades significantly when the training and test data domains are mismatched.

automatic-speech-recognition Speech Recognition +1

Semi-Supervised Speech Recognition via Graph-based Temporal Classification

no code implementations29 Oct 2020 Niko Moritz, Takaaki Hori, Jonathan Le Roux

However, alternative ASR hypotheses of an N-best list can provide more accurate labels for an unlabeled speech utterance and also reflect uncertainties of the seed ASR model.

automatic-speech-recognition Classification +2

Transcription Is All You Need: Learning to Separate Musical Mixtures with Score as Supervision

no code implementations22 Oct 2020 Yun-Ning Hung, Gordon Wichern, Jonathan Le Roux

Most music source separation systems require large collections of isolated sources for training, which can be difficult to obtain.

Music Source Separation

Multi-Pass Transformer for Machine Translation

no code implementations23 Sep 2020 Peng Gao, Chiori Hori, Shijie Geng, Takaaki Hori, Jonathan Le Roux

In contrast with previous approaches where information flows only towards deeper layers of a stack, we consider a multi-pass transformer (MPT) architecture in which earlier layers are allowed to process information in light of the output of later layers.

Machine Translation Neural Architecture Search +1

AutoClip: Adaptive Gradient Clipping for Source Separation Networks

1 code implementation25 Jul 2020 Prem Seetharaman, Gordon Wichern, Bryan Pardo, Jonathan Le Roux

Clipping the gradient is a known approach to improving gradient descent, but requires hand selection of a clipping threshold hyperparameter.

Audio Source Separation

Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

no code implementations8 Jul 2020 Shijie Geng, Peng Gao, Moitreya Chatterjee, Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li, Anoop Cherian

Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human about the audio-visual content.

Graph Representation Learning

Unsupervised Speaker Adaptation using Attention-based Speaker Memory for End-to-End ASR

no code implementations14 Feb 2020 Leda Sari, Niko Moritz, Takaaki Hori, Jonathan Le Roux

We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR).

automatic-speech-recognition End-To-End Speech Recognition +1

End-to-End Multi-speaker Speech Recognition with Transformer

no code implementations10 Feb 2020 Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe

Recently, fully recurrent neural network (RNN) based end-to-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios.

Speech Recognition

Streaming automatic speech recognition with the transformer model

no code implementations8 Jan 2020 Niko Moritz, Takaaki Hori, Jonathan Le Roux

Encoder-decoder based sequence-to-sequence models have demonstrated state-of-the-art results in end-to-end automatic speech recognition (ASR).

automatic-speech-recognition End-To-End Speech Recognition +1

Finding Strength in Weakness: Learning to Separate Sounds with Weak Supervision

no code implementations6 Nov 2019 Fatemeh Pishdadian, Gordon Wichern, Jonathan Le Roux

In this scenario, weak labels are defined in contrast with strong time-frequency (TF) labels such as those obtained from isolated sources, and refer either to frame-level weak labels where one only has access to the time periods when different sources are active in an audio mixture, or to clip-level weak labels that only indicate the presence or absence of sounds in an entire audio clip.

Audio Source Separation

Bootstrapping deep music separation from primitive auditory grouping principles

no code implementations23 Oct 2019 Prem Seetharaman, Gordon Wichern, Jonathan Le Roux, Bryan Pardo

They are trained on synthetic mixtures of audio made from isolated sound source recordings so that ground truth for the separation is known.

Music Source Separation

WHAMR!: Noisy and Reverberant Single-Channel Speech Separation

no code implementations22 Oct 2019 Matthew Maciejewski, Gordon Wichern, Emmett McQuinn, Jonathan Le Roux

While significant advances have been made with respect to the separation of overlapping speech signals, studies have been largely constrained to mixtures of clean, near anechoic speech, not representative of many real-world scenarios.

Sound Audio and Speech Processing

MIMO-SPEECH: End-to-End Multi-Channel Multi-Speaker Speech Recognition

no code implementations15 Oct 2019 Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe

In this work, we propose a novel neural sequence-to-sequence (seq2seq) architecture, MIMO-Speech, which extends the original seq2seq to deal with multi-channel input and multi-channel output so that it can fully model multi-channel multi-speaker speech separation and recognition.

Curriculum Learning Speech Recognition +1

WHAM!: Extending Speech Separation to Noisy Environments

1 code implementation2 Jul 2019 Gordon Wichern, Joe Antognini, Michael Flynn, Licheng Richard Zhu, Emmett McQuinn, Dwight Crow, Ethan Manilow, Jonathan Le Roux

Recent progress in separating the speech signals from multiple overlapping speakers using a single audio channel has brought us closer to solving the cocktail party problem.

Speech Separation

Class-conditional embeddings for music source separation

no code implementations7 Nov 2018 Prem Seetharaman, Gordon Wichern, Shrikant Venkataramani, Jonathan Le Roux

Isolating individual instruments in a musical mixture has a myriad of potential applications, and seems imminently achievable given the levels of performance reached by recent deep learning methods.

Deep Clustering Music Source Separation

Bootstrapping single-channel source separation via unsupervised spatial clustering on stereo mixtures

no code implementations6 Nov 2018 Prem Seetharaman, Gordon Wichern, Jonathan Le Roux, Bryan Pardo

These estimates, together with a weighting scheme in the time-frequency domain, based on confidence in the separation quality, are used to train a deep learning model that can be used for single-channel separation, where no source direction information is available.

Semantic Segmentation Unsupervised Spatial Clustering

SDR - half-baked or well done?

1 code implementation6 Nov 2018 Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, John R. Hershey

In speech enhancement and source separation, signal-to-noise ratio is a ubiquitous objective measure of denoising/separation quality.

Sound Audio and Speech Processing

Cycle-consistency training for end-to-end speech recognition

no code implementations2 Nov 2018 Takaaki Hori, Ramon Astudillo, Tomoki Hayashi, Yu Zhang, Shinji Watanabe, Jonathan Le Roux

To solve this problem, this work presents a loss that is based on the speech encoder state sequence instead of the raw speech signal.

automatic-speech-recognition End-To-End Speech Recognition +2

Phasebook and Friends: Leveraging Discrete Representations for Source Separation

no code implementations2 Oct 2018 Jonathan Le Roux, Gordon Wichern, Shinji Watanabe, Andy Sarroff, John R. Hershey

Here, we propose "magbook", "phasebook", and "combook", three new types of layers based on discrete representations that can be used to estimate complex time-frequency masks.

Speaker Separation Speech Enhancement

A Purely End-to-end System for Multi-speaker Speech Recognition

no code implementations ACL 2018 Hiroshi Seki, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux, John R. Hershey

In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner.

Speech Recognition

End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction

no code implementations26 Apr 2018 Zhong-Qiu Wang, Jonathan Le Roux, DeLiang Wang, John R. Hershey

In addition, we train through unfolded iterations of a phase reconstruction algorithm, represented as a series of STFT and inverse STFT layers.

Speech Separation

Deep Clustering and Conventional Networks for Music Separation: Stronger Together

no code implementations18 Nov 2016 Yi Luo, Zhuo Chen, John R. Hershey, Jonathan Le Roux, Nima Mesgarani

Deep clustering is the first method to handle general audio separation scenarios with multiple sources of the same type and an arbitrary number of sources, performing impressively in speaker-independent speech separation tasks.

Deep Clustering Multi-Task Learning +2

Full-Capacity Unitary Recurrent Neural Networks

2 code implementations NeurIPS 2016 Scott Wisdom, Thomas Powers, John R. Hershey, Jonathan Le Roux, Les Atlas

To address this question, we propose full-capacity uRNNs that optimize their recurrence matrix over all unitary matrices, leading to significantly improved performance over uRNNs that use a restricted-capacity recurrence matrix.

Sequential Image Classification

Single-Channel Multi-Speaker Separation using Deep Clustering

2 code implementations7 Jul 2016 Yusuf Isik, Jonathan Le Roux, Zhuo Chen, Shinji Watanabe, John R. Hershey

In this paper we extend the baseline system with an end-to-end signal approximation objective that greatly improves performance on a challenging speech separation.

automatic-speech-recognition Deep Clustering +3

Deep clustering: Discriminative embeddings for segmentation and separation

7 code implementations18 Aug 2015 John R. Hershey, Zhuo Chen, Jonathan Le Roux, Shinji Watanabe

The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources.

Deep Clustering Semantic Segmentation +1

Deep Unfolding: Model-Based Inspiration of Novel Deep Architectures

no code implementations9 Sep 2014 John R. Hershey, Jonathan Le Roux, Felix Weninger

Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm.

Speech Enhancement

Block Coordinate Descent for Sparse NMF

1 code implementation15 Jan 2013 Vamsi K. Potluru, Sergey M. Plis, Jonathan Le Roux, Barak A. Pearlmutter, Vince D. Calhoun, Thomas P. Hayes

However, present algorithms designed for optimizing the mixed norm L$_1$/L$_2$ are slow and other formulations for sparse NMF have been proposed such as those based on L$_1$ and L$_0$ norms.

Cannot find the paper you are looking for? You can Submit a new open access paper.