Search Results for author: Hirofumi Inaguma

Found 21 papers, 6 papers with code

A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation

no code implementations11 Oct 2021 Yosuke Higuchi, Nanxin Chen, Yuya Fujita, Hirofumi Inaguma, Tatsuya Komatsu, Jaesong Lee, Jumon Nozaki, Tianzi Wang, Shinji Watanabe

Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines.

automatic-speech-recognition Speech Recognition +2

ASR Rescoring and Confidence Estimation with ELECTRA

no code implementations5 Oct 2021 Hayato Futami, Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

We propose an ASR rescoring method for directly detecting errors with ELECTRA, which is originally a pre-training method for NLP tasks.

automatic-speech-recognition Fine-tuning +2

Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with Non-Autoregressive Hidden Intermediates

no code implementations27 Sep 2021 Hirofumi Inaguma, Siddharth Dalmia, Brian Yan, Shinji Watanabe

We propose Fast-MD, a fast MD model that generates HI by non-autoregressive (NAR) decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.

automatic-speech-recognition Language Modelling +3

Non-autoregressive End-to-end Speech Translation with Parallel Autoregressive Rescoring

no code implementations9 Sep 2021 Hirofumi Inaguma, Yosuke Higuchi, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe

We propose a unified NAR E2E-ST framework called Orthros, which has an NAR decoder and an auxiliary shallow AR decoder on top of the shared encoder.

Language Modelling Translation

VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording

no code implementations15 Jul 2021 Hirofumi Inaguma, Tatsuya Kawahara

In this work, we propose novel decoding algorithms to enable streaming automatic speech recognition (ASR) on unsegmented long-form recordings without voice activity detection (VAD), based on monotonic chunkwise attention (MoChA) with an auxiliary connectionist temporal classification (CTC) objective.

Action Detection Activity Detection +2

Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition

no code implementations28 Feb 2021 Hirofumi Inaguma, Tatsuya Kawahara

We compare CTC-ST with several methods that distill alignment knowledge from a hybrid ASR system and show that the CTC-ST can achieve a comparable tradeoff of accuracy and latency without relying on external alignment information.

automatic-speech-recognition Knowledge Distillation +1

Improved Mask-CTC for Non-Autoregressive End-to-End ASR

no code implementations26 Oct 2020 Yosuke Higuchi, Hirofumi Inaguma, Shinji Watanabe, Tetsuji Ogawa, Tetsunori Kobayashi

While Mask-CTC achieves remarkably fast inference speed, its recognition performance falls behind that of conventional autoregressive (AR) systems.

automatic-speech-recognition End-To-End Speech Recognition +2

Orthros: Non-autoregressive End-to-end Speech Translation with Dual-decoder

no code implementations25 Oct 2020 Hirofumi Inaguma, Yosuke Higuchi, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe

Fast inference speed is an important goal towards real-world deployment of speech translation (ST) systems.

Translation

Distilling the Knowledge of BERT for Sequence-to-Sequence ASR

1 code implementation9 Aug 2020 Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

Experimental evaluations show that our method significantly improves the ASR performance from the seq2seq baseline on the Corpus of Spontaneous Japanese (CSJ).

automatic-speech-recognition Knowledge Distillation +2

Enhancing Monotonic Multihead Attention for Streaming ASR

1 code implementation19 May 2020 Hirofumi Inaguma, Masato Mimura, Tatsuya Kawahara

For streaming inference, all monotonic attention (MA) heads should learn proper alignments because the next token is not generated until all heads detect the corresponding token boundaries.

automatic-speech-recognition Boundary Detection +1

CTC-synchronous Training for Monotonic Attention Model

1 code implementation10 May 2020 Hirofumi Inaguma, Masato Mimura, Tatsuya Kawahara

Monotonic chunkwise attention (MoChA) has been studied for the online streaming automatic speech recognition (ASR) based on a sequence-to-sequence framework.

automatic-speech-recognition Speech Recognition

End-to-end speech-to-dialog-act recognition

no code implementations23 Apr 2020 Viet-Trung Dang, Tianyu Zhao, Sei Ueno, Hirofumi Inaguma, Tatsuya Kawahara

In the proposed model, the dialog act recognition network is conjunct with an acoustic-to-word ASR model at its latent layer before the softmax layer, which provides a distributed representation of word-level ASR decoding information.

automatic-speech-recognition Language understanding +2

Multilingual End-to-End Speech Translation

1 code implementation1 Oct 2019 Hirofumi Inaguma, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe

In this paper, we propose a simple yet effective framework for multilingual end-to-end speech translation (ST), in which speech utterances in source languages are directly translated to the desired target languages with a universal sequence-to-sequence architecture.

automatic-speech-recognition Machine Translation +3

Improving OOV Detection and Resolution with External Language Models in Acoustic-to-Word ASR

no code implementations22 Sep 2019 Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

Moreover, the A2C model can be used to recover out-of-vocabulary (OOV) words that are not covered by the A2W model, but this requires accurate detection of OOV words.

automatic-speech-recognition Speech Recognition

Transfer learning of language-independent end-to-end ASR with language model fusion

no code implementations6 Nov 2018 Hirofumi Inaguma, Jaejin Cho, Murali Karthick Baskar, Tatsuya Kawahara, Shinji Watanabe

This work explores better adaptation methods to low-resource languages using an external language model (LM) under the framework of transfer learning.

End-To-End Speech Recognition Language Modelling +1

Cannot find the paper you are looking for? You can Submit a new open access paper.