The Speechtransformer for Large-scale Mandarin Chinese Speech Recognition

Attention-based sequence-to-sequence architectures have made great progress in the speech recognition task. The SpeechTransformer, a no-recurrence encoder-decoder architecture, has shown promising results on small-scale speech recognition data sets in previous works. In this paper, we focus on a large-scale Mandarin Chinese speech recognition task and propose three optimization strategies to further improve the performance and efficiency of the SpeechTransformer. Our first improvement is to use a much lower frame rate, which is shown very beneficial to not only the computation efficiency but also the model performance. The other two strategies are scheduled sampling and focal loss, which are both very effective to reduce the character error rate (CER). On a 8,000 hours task, the proposed improvements yield 10.8%-26.1% relative gain in CER on four different test sets. Compared to a strong hybrid TDNN-LSTM system, which is trained with LF-MMI criterion and decoded with a large 4-gram LM, the final optimized Speech-Transformer gives 12.2%-19.1% relative CER reduction without any explicit language models.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here