Streaming Transformer ASR with Blockwise Synchronous Inference

25 Jun 2020 · Emiru Tsunoo, Yosuke Kashiwagi, Shinji Watanabe ·

The Transformer self-attention network has recently shown promising performance as an alternative to recurrent neural networks in end-to-end (E2E) automatic speech recognition (ASR) systems. However, Transformer has a drawback in that the entire input sequence is required to compute self-attention. We have proposed a block processing method for the Transformer encoder by introducing a context-aware inheritance mechanism. An additional context embedding vector handed over from the previously processed block helps encode not only local acoustic information but also global linguistic, channel, and speaker attributes. In this paper, we extend block processing towards an entire streaming E2E ASR system without additional training, by introducing a blockwise synchronous decoding process inspired by a neural transducer into the Transformer decoder. We further apply a knowledge distillation technique with which training of the streaming Transformer is guided by the ordinary batch Transformer model. Evaluations of the HKUST and AISHELL-1 Mandarin tasks and LibriSpeech English task show that our proposed streaming Transformer outperforms conventional online approaches including monotonic chunkwise attention (MoChA). We also confirm that the knowledge distillation technique improves the accuracy further. Our streaming ASR models achieve comparable/superior performance to the batch models and other streaming-based transformer methods in all the tasks.

PDF Abstract