no code implementations • 10 Nov 2021 • Alex Xiao, Weiyi Zheng, Gil Keren, Duc Le, Frank Zhang, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Abdelrahman Mohamed
With 4. 5 million hours of English speech from 10 different sources across 120 countries and models of up to 10 billion parameters, we explore the frontiers of scale for automatic speech recognition.
no code implementations • 5 Apr 2021 • Duc Le, Mahaveer Jain, Gil Keren, Suyoun Kim, Yangyang Shi, Jay Mahadeokar, Julian Chan, Yuan Shangguan, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Michael L. Seltzer
How to leverage dynamic contextual information in end-to-end speech recognition has remained an active research area.
no code implementations • 16 Nov 2020 • Duc Le, Gil Keren, Julian Chan, Jay Mahadeokar, Christian Fuegen, Michael L. Seltzer
End-to-end models in general, and Recurrent Neural Network Transducer (RNN-T) in particular, have gained significant traction in the automatic speech recognition community in the last few years due to their simplicity, compactness, and excellent performance on generic transcription tasks.
no code implementations • 5 Nov 2020 • Jay Mahadeokar, Yuan Shangguan, Duc Le, Gil Keren, Hang Su, Thong Le, Ching-Feng Yeh, Christian Fuegen, Michael L. Seltzer
There is a growing interest in the speech community in developing Recurrent Neural Network Transducer (RNN-T) models for automatic speech recognition (ASR) applications.
no code implementations • 4 Jun 2020 • Mahaveer Jain, Gil Keren, Jay Mahadeokar, Geoffrey Zweig, Florian Metze, Yatharth Saraf
By using an attention model and a biasing model to leverage the contextual metadata that accompanies a video, we observe a relative improvement of about 16% in Word Error Rate on Named Entities (WER-NE) for videos with related metadata.
1 code implementation • 16 Nov 2019 • Shuo Liu, Gil Keren, Björn Schuller
N-HANS is a Python toolkit for in-the-wild audio enhancement, including speech, music, and general audio denoising, separation, and selective noise or source suppression.
Sound Audio and Speech Processing
no code implementations • 24 Jun 2019 • Shuo Liu, Gil Keren, Björn Schuller
We present a novel source separation model to decompose asingle-channel speech signal into two speech segments belonging to two different speakers.
no code implementations • 26 Oct 2018 • Gil Keren, Jing Han, Björn Schuller
We address the problem of speech enhancement generalisation to unseen environments by performing two manipulations.
1 code implementation • 26 Mar 2018 • Gil Keren, NIcholas Cummins, Björn Schuller
Despite their obvious aforementioned advantage in relation to accuracy, contemporary neural networks can, generally, be regarded as poorly calibrated and as such do not produce reliable output probability estimates.
no code implementations • 10 Jan 2018 • Gil Keren, Maximilian Schmitt, Thomas Kehrenberg, Björn Schuller
Neural network models that are not conditioned on class identities were shown to facilitate knowledge transfer between classes and to be well-suited for one-shot learning tasks.
no code implementations • ICLR 2018 • Gil Keren, Sivan Sabato, Björn Schuller
In contrast, there are known loss functions, as well as novel batch loss functions that we propose, which are aligned with this principle.
2 code implementations • 29 May 2017 • Gil Keren, Sivan Sabato, Björn Schuller
Our experiments show that indeed in almost all cases, losses that are aligned with the Principle of Logit Separation obtain at least 20% relative accuracy improvement in the SLC task compared to losses that are not aligned with it, and sometimes considerably more.
no code implementations • 23 Nov 2016 • Gil Keren, Sivan Sabato, Björn Schuller
We propose incorporating this idea of tunable sensitivity for hard examples in neural network learning, using a new generalization of the cross-entropy gradient step, which can be used in place of the gradient in any gradient-based training method.
3 code implementations • 18 Feb 2016 • Gil Keren, Björn Schuller
Traditional convolutional layers extract features from patches of data by applying a non-linearity on an affine function of the input.