no code implementations • 16 Sep 2023 • Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe
Because the decoder architecture is the same as an autoregressive LM, it is simple to enhance the model by leveraging external text data with LM training.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 24 Jul 2023 • Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe
Although frame-based models, such as CTC and transducers, have an affinity for streaming automatic speech recognition, their decoding uses no future knowledge, which could lead to incorrect pruning.
no code implementations • 20 Jul 2023 • Siddhant Arora, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Brian Yan, Shinji Watanabe
There has been an increased interest in the integration of pretrained speech recognition (ASR) and language models (LM) into the SLU framework.
no code implementations • 2 Jun 2023 • Yosuke Kashiwagi, Siddhant Arora, Hayato Futami, Jessica Huynh, Shih-Lun Wu, Yifan Peng, Brian Yan, Emiru Tsunoo, Shinji Watanabe
We reduce the model size by applying tensor decomposition to the Conformer and E-Branchformer architectures used in our E2E SLU models.
no code implementations • 2 May 2023 • Siddhant Arora, Hayato Futami, Shih-Lun Wu, Jessica Huynh, Yifan Peng, Yosuke Kashiwagi, Emiru Tsunoo, Brian Yan, Shinji Watanabe
Recently there have been efforts to introduce new benchmark tasks for spoken language understanding (SLU), like semantic parsing.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • 2 May 2023 • Hayato Futami, Jessica Huynh, Siddhant Arora, Shih-Lun Wu, Yosuke Kashiwagi, Yifan Peng, Brian Yan, Emiru Tsunoo, Shinji Watanabe
In the track, we adopt a pipeline approach of ASR and NLU.
no code implementations • 1 May 2023 • Siddhant Arora, Hayato Futami, Emiru Tsunoo, Brian Yan, Shinji Watanabe
Most human interactions occur in the form of spoken conversations where the semantic meaning of a given utterance depends on the context.
1 code implementation • 16 Nov 2022 • Hayato Futami, Emiru Tsunoo, Kentaro Shibata, Yosuke Kashiwagi, Takao Okuda, Siddhant Arora, Shinji Watanabe
In this study, we propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection, which work in a streaming manner.
no code implementations • 15 Jun 2022 • Emiru Tsunoo, Yosuke Kashiwagi, Chaitanya Narisetty, Shinji Watanabe
In this paper, we propose a simple external LM fusion method for domain adaptation, which considers the internal LM estimation in its training.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • 3 Feb 2022 • Chaitanya Narisetty, Emiru Tsunoo, Xuankai Chang, Yosuke Kashiwagi, Michael Hentschel, Shinji Watanabe
A major hurdle in evaluating our proposed approach is the lack of labeled audio datasets with both speech transcriptions and audio captions.
no code implementations • 25 Jan 2022 • Emiru Tsunoo, Chaitanya Narisetty, Michael Hentschel, Yosuke Kashiwagi, Shinji Watanabe
To this end, we propose a novel blockwise synchronous decoding algorithm with a hybrid approach that combines endpoint prediction and endpoint post-determination.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 24 Jan 2022 • Rem Hida, Masaki Hamada, Chie Kamada, Emiru Tsunoo, Toshiyuki Sekiya, Toshiyuki Kumakura
Although end-to-end text-to-speech (TTS) models can generate natural speech, challenges still remain when it comes to estimating sentence-level phonetic and prosodic information from raw text in Japanese TTS systems.
2 code implementations • 14 Oct 2021 • Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, Naoya Takahashi, Emiru Tsunoo, Yuki Mitsufuji
The multi- ACCDOA format (a class- and track-wise output format) enables the model to solve the cases with overlaps from the same class.
no code implementations • 21 Jun 2021 • Kazuki Shimada, Naoya Takahashi, Yuichiro Koyama, Shusuke Takahashi, Emiru Tsunoo, Masafumi Takahashi, Yuki Mitsufuji
This report describes our systems submitted to the DCASE2021 challenge task 3: sound event localization and detection (SELD) with directional interference.
no code implementations • 7 Jun 2021 • Emiru Tsunoo, Kentaro Shibata, Chaitanya Narisetty, Yosuke Kashiwagi, Shinji Watanabe
Although end-to-end automatic speech recognition (E2E ASR) has achieved great performance in tasks that have numerous paired data, it is still challenging to make E2E ASR robust against noisy and low-resource conditions.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • 18 Feb 2021 • Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe
Self-attention (SA) based models have recently achieved significant performance improvements in hybrid and end-to-end automatic speech recognition (ASR) systems owing to their flexible context modeling capability.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 25 Jun 2020 • Emiru Tsunoo, Yosuke Kashiwagi, Shinji Watanabe
In this paper, we extend block processing towards an entire streaming E2E ASR system without additional training, by introducing a blockwise synchronous decoding process inspired by a neural transducer into the Transformer decoder.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 25 Oct 2019 • Emiru Tsunoo, Yosuke Kashiwagi, Toshiyuki Kumakura, Shinji Watanabe
In this paper, we extend it towards an entire online E2E ASR system by introducing an online decoding process inspired by monotonic chunkwise attention (MoChA) into the Transformer decoder.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 16 Oct 2019 • Emiru Tsunoo, Yosuke Kashiwagi, Toshiyuki Kumakura, Shinji Watanabe
In this paper, we propose a new block processing method for the Transformer encoder by introducing a context-aware inheritance mechanism.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 17 May 2019 • Emiru Tsunoo, Yosuke Kashiwagi, Satoshi Asakawa, Toshiyuki Kumakura
We convert a pretrained WFST to a trainable neural network and adapt the system to target environments/vocabulary by E2E joint training with an AM.