no code implementations • 16 Jan 2025 • Takaaki Hori, Martin Kocour, Adnan Haider, Erik McDermott, Xiaodan Zhuang
We propose "delayed fusion," which applies LLM scores to ASR hypotheses with a delay during decoding and enables easier use of pre-trained LLMs in ASR tasks.
no code implementations • 1 Nov 2024 • Nikolaos Flemotomos, Roger Hsiao, Pawel Swietojanski, Takaaki Hori, Dogan Can, Xiaodan Zhuang
However, the biasing mechanism is typically based on a cross-attention module between the audio and a catalogue of biasing entries, which means computational complexity can pose severe practical limitations on the size of the biasing catalogue and consequently on accuracy improvements.
no code implementations • 23 Aug 2024 • Adnan Haider, Xingyu Na, Erik McDermott, Tim Ng, Zhen Huang, Xiaodan Zhuang
This paper introduces a novel training framework called Focused Discriminative Training (FDT) to further improve streaming word-piece end-to-end (E2E) automatic speech recognition (ASR) models trained using either CTC or an interpolation of CTC and attention-based encoder-decoder (AED) loss.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
no code implementations • 14 Jun 2024 • Roger Hsiao, Liuhui Deng, Erik McDermott, Ruchir Travadi, Xiaodan Zhuang
Byte-level representation is often used by large scale multilingual ASR systems when the character set of the supported languages is large.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 18 Apr 2023 • Maurits Bleeker, Pawel Swietojanski, Stefan Braun, Xiaodan Zhuang
By including approximate nearest neighbour phrases (ANN-P) in the context list, we encourage the learned representation to disambiguate between similar, but not identical, biasing phrases.
no code implementations • 2 Nov 2022 • Pawel Swietojanski, Stefan Braun, Dogan Can, Thiago Fraga da Silva, Arnab Ghoshal, Takaaki Hori, Roger Hsiao, Henry Mason, Erik McDermott, Honza Silovsky, Ruchir Travadi, Xiaodan Zhuang
This work studies the use of attention masking in transformer transducer based speech recognition for building a single configurable model for different deployment scenarios.
no code implementations • 27 Aug 2021 • Zhen Huang, Xiaodan Zhuang, Daben Liu, Xiaoqiang Xiao, Yuchen Zhang, Sabato Marco Siniscalchi
To achieve such an ambitious goal, new mechanisms for foreign pronunciation generation and language model (LM) enrichment have been devised.
no code implementations • 7 Dec 2020 • Xinwei Li, Yuanyuan Zhang, Xiaodan Zhuang, Daben Liu
We demonstrate that f-SpecAugment is more effective than the utterance level SpecAugment for deep CNN based hybrid models.
no code implementations • 4 Oct 2019 • Zhen Huang, Tim Ng, Leo Liu, Henry Mason, Xiaodan Zhuang, Daben Liu
The most popular way to train very deep CNNs is to use shortcut connections (SC) together with batch normalization (BN).
no code implementations • CVPR 2014 • Shuang Wu, Sravanthi Bondugula, Florian Luisier, Xiaodan Zhuang, Pradeep Natarajan
Current state-of-the-art systems for visual content analysis require large training sets for each class of interest, and performance degrades rapidly with fewer examples.