no code implementations • 20 Nov 2017 • Chung-Cheng Chiu, Anshuman Tripathi, Katherine Chou, Chris Co, Navdeep Jaitly, Diana Jaunzeikare, Anjuli Kannan, Patrick Nguyen, Hasim Sak, Ananth Sankar, Justin Tansuwan, Nathan Wan, Yonghui Wu, Xuedong Zhang
We explored both CTC and LAS systems for building speech recognition models.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 31 Oct 2016 • Hagen Soltau, Hank Liao, Hasim Sak
We present results that show it is possible to build a competitive, greatly simplified, large vocabulary continuous speech recognition system with whole words as acoustic units.
no code implementations • 10 Mar 2016 • Ian McGraw, Rohit Prabhavalkar, Raziel Alvarez, Montse Gonzalez Arenas, Kanishka Rao, David Rybach, Ouais Alsharif, Hasim Sak, Alexander Gruenstein, Francoise Beaufays, Carolina Parada
We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone.
no code implementations • ICLR 2019 • Brendan Shillingford, Yannis Assael, Matthew W. Hoffman, Thomas Paine, Cían Hughes, Utsav Prabhu, Hank Liao, Hasim Sak, Kanishka Rao, Lorrayne Bennett, Marie Mulville, Ben Coppin, Ben Laurie, Andrew Senior, Nando de Freitas
To achieve this, we constructed the largest existing visual speech recognition dataset, consisting of pairs of text and video clips of faces speaking (3, 886 hours of video).
Ranked #11 on Lipreading on LRS3-TED (using extra training data)
no code implementations • 17 Jun 2019 • Ke Hu, Hasim Sak, Hank Liao
In this work, we apply the domain adversarial network to encourage the shared layers of a multilingual model to learn language-invariant features.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 26 Feb 2020 • Erik McDermott, Hasim Sak, Ehsan Variani
The proposed approach is evaluated in cross-domain and limited-data scenarios, for which a significant amount of target domain text data is used for LM training, but only limited (or no) {audio, transcript} training data pairs are used to train the RNN-T.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 7 Oct 2020 • Anshuman Tripathi, Jaeyoung Kim, Qian Zhang, Han Lu, Hasim Sak
In this paper we present a Transformer-Transducer model architecture and a training technique to unify streaming and non-streaming speech recognition models into one model.
no code implementations • 6 May 2021 • Jaeyoung Kim, Han Lu, Anshuman Tripathi, Qian Zhang, Hasim Sak
From LibriSpeech evaluation, self alignment outperformed existing schemes: 25% and 56% less delay compared to FastEmit and constrained alignment at the similar word error rate.
no code implementations • 27 May 2022 • Soheil Khorram, Jaeyoung Kim, Anshuman Tripathi, Han Lu, Qian Zhang, Hasim Sak
This paper introduces contrastive siamese (c-siam) network, an architecture for leveraging unlabeled acoustic data in speech recognition.
5 code implementations • 7 Feb 2020 • Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, Shankar Kumar
We present results on the LibriSpeech dataset showing that limiting the left context for self-attention in the Transformer layers makes decoding computationally tractable for streaming, with only a slight degradation in accuracy.
1 code implementation • 23 Sep 2021 • Wei Xia, Han Lu, Quan Wang, Anshuman Tripathi, Yiling Huang, Ignacio Lopez Moreno, Hasim Sak
In this paper, we present a novel speaker diarization system for streaming on-device applications.