The Microsoft 2016 Conversational Speech Recognition System

12 Sep 2016  ·  W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, G. Zweig ·

We describe Microsoft's conversational speech recognition system, in which we combine recent developments in neural-network-based acoustic and language modeling to advance the state of the art on the Switchboard recognition task. Inspired by machine learning ensemble techniques, the system uses a range of convolutional and recurrent neural networks. I-vector modeling and lattice-free MMI training provide significant gains for all acoustic model architectures. Language model rescoring with multiple forward and backward running RNNLMs, and word posterior-based system combination provide a 20% boost. The best single system uses a ResNet architecture acoustic model with RNNLM rescoring, and achieves a word error rate of 6.9% on the NIST 2000 Switchboard task. The combined system has an error rate of 6.2%, representing an improvement over previously reported results on this benchmark task.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Speech Recognition swb_hub_500 WER fullSWBCH VGG/Resnet/LACE/BiLSTM acoustic model trained on SWB+Fisher+CH, N-gram + RNNLM language model trained on Switchboard+Fisher+Gigaword+Broadcast Percentage error 11.9 # 4
Speech Recognition Switchboard + Hub500 RNNLM Percentage error 6.9 # 9
Speech Recognition Switchboard + Hub500 VGG/Resnet/LACE/BiLSTM acoustic model trained on SWB+Fisher+CH, N-gram + RNNLM language model trained on Switchboard+Fisher+Gigaword+Broadcast Percentage error 6.3 # 6
Speech Recognition Switchboard + Hub500 Microsoft 2016 Percentage error 6.2 # 5

Methods