no code implementations • 3 Nov 2022 • Pingchuan Ma, Niko Moritz, Stavros Petridis, Christian Fuegen, Maja Pantic
The audio and the visual encoder neural networks are both based on the conformer architecture, which is made streamable using chunk-wise self-attention (CSA) and causal convolution.
Audio-Visual Speech Recognition
Automatic Speech Recognition
+4
no code implementations • 20 Oct 2022 • Desh Raj, Junteng Jia, Jay Mahadeokar, Chunyang Wu, Niko Moritz, Xiaohui Zhang, Ozlem Kalinli
Anchored speech recognition refers to a class of methods that use information from an anchor segment (e. g., wake-words) to recognize device-directed speech while ignoring interfering background speech/noise.
no code implementations • 19 Apr 2022 • Niko Moritz, Frank Seide, Duc Le, Jay Mahadeokar, Christian Fuegen
The two most popular loss functions for streaming end-to-end automatic speech recognition (ASR) are RNN-Transducer (RNN-T) and connectionist temporal classification (CTC).
no code implementations • 1 Mar 2022 • Xuankai Chang, Niko Moritz, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux
As an example application, we use the extended GTC (GTC-e) for the multi-speaker speech recognition task.
no code implementations • 1 Nov 2021 • Niko Moritz, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux
The recurrent neural network transducer (RNN-T) objective plays a major role in building today's best automatic speech recognition (ASR) systems for production.
no code implementations • 11 Oct 2021 • Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, Takaaki Hori
Pseudo-labeling (PL), a semi-supervised learning (SSL) method where a seed model performs self-training using pseudo-labels generated from untranscribed speech, has been shown to enhance the performance of end-to-end automatic speech recognition (ASR).
no code implementations • 2 Jul 2021 • Niko Moritz, Takaaki Hori, Jonathan Le Roux
Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks.
no code implementations • 16 Jun 2021 • Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, Takaaki Hori
MPL consists of a pair of online and offline models that interact and learn from each other, inspired by the mean teacher method.
no code implementations • 19 Apr 2021 • Takaaki Hori, Niko Moritz, Chiori Hori, Jonathan Le Roux
In this paper, we extend our prior work by (1) introducing the Conformer architecture to further improve the accuracy, (2) accelerating the decoding process with a novel activation recycling technique, and (3) enabling streaming decoding with triggered attention.
no code implementations • 7 Apr 2021 • Niko Moritz, Takaaki Hori, Jonathan Le Roux
The restricted self-attention allows attention to neighboring frames of the query at a high resolution, and the dilation mechanism summarizes distant information to allow attending to it with a lower resolution.
no code implementations • 26 Nov 2020 • Sameer Khurana, Niko Moritz, Takaaki Hori, Jonathan Le Roux
The performance of automatic speech recognition (ASR) systems typically degrades significantly when the training and test data domains are mismatched.
no code implementations • 29 Oct 2020 • Niko Moritz, Takaaki Hori, Jonathan Le Roux
However, alternative ASR hypotheses of an N-best list can provide more accurate labels for an unlabeled speech utterance and also reflect uncertainties of the seed ASR model.
no code implementations • 14 Feb 2020 • Leda Sari, Niko Moritz, Takaaki Hori, Jonathan Le Roux
We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR).
no code implementations • 8 Jan 2020 • Niko Moritz, Takaaki Hori, Jonathan Le Roux
Encoder-decoder based sequence-to-sequence models have demonstrated state-of-the-art results in end-to-end automatic speech recognition (ASR).