no code implementations • 20 Feb 2024 • Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe
Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 30 Jan 2024 • Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe
In this work, we aim to improve the performance and efficiency of OWSM without extra training data.
no code implementations • 19 Jan 2024 • Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Yifan Peng, Shinji Watanabe
The proposed method can be trained effectively by combining a bias phrase index loss and special tokens to detect the bias phrases in the input speech data.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
1 code implementation • 25 Sep 2023 • Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, Shinji Watanabe
Pre-training speech models on large volumes of data has achieved remarkable success.
no code implementations • 29 May 2023 • Yui Sudo, Kazuya Hata, Kazuhiro Nakadai
End-to-end automatic speech recognition (E2E-ASR) has the potential to improve performance, but a specific issue that needs to be addressed is the difficulty it has in handling enharmonic words: named entities (NEs) with the same pronunciation and part of speech that are spelled differently.
1 code implementation • 28 May 2023 • Yifan Peng, Yui Sudo, Shakeel Muhammad, Shinji Watanabe
Knowledge distillation trains a small student model to mimic the behavior of a large teacher model.
no code implementations • 21 Dec 2022 • Yui Sudo, Muhammad Shakeel, Brian Yan, Jiatong Shi, Shinji Watanabe
The network architecture of end-to-end (E2E) automatic speech recognition (ASR) can be classified into several models, including connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention mechanism, and non-autoregressive mask-predict models.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1