Speech recognition is the task of recognising speech within audio and converting it into text.
|Trend||Dataset||Best Method||Paper title||Paper||Code||Compare|
We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages.
SOTA for Speech Recognition on WSJ eval93
We present a state-of-the-art speech recognition system developed using end-to-end deep learning.
#2 best model for Noisy Speech Recognition on CHiME clean
We present our work on end-to-end training of acoustic models using the lattice-free maximum mutual information (LF-MMI) objective function in the context of hidden Markov models.
In this paper we describe an extension of the Kaldi software toolkit to support neural-based language modeling, intended for use in automatic speech recognition (ASR) and related tasks.
#2 best model for Speech Recognition on LibriSpeech test-other
This paper describes a new baseline system for automatic speech recognition (ASR) in the CHiME-4 challenge to promote the development of noisy ASR in speech processing communities by providing 1) state-of-the-art system with a simplified single system comparable to the complicated top systems in the challenge, 2) publicly available and reproducible recipe through the main repository in the Kaldi speech recognition toolkit.
Models trained with LFMMI provide a relative word error rate reduction of ∼11. 5%, over those trained with cross-entropy objective function, and ∼8%, over those trained with cross-entropy and sMBR objective functions.
SOTA for Speech Recognition on WSJ eval92
This paper introduces wav2letter++, the fastest open-source deep learning speech recognition framework.
This approach to decoding enables first-pass speech recognition with a language model, completely unaided by the cumbersome infrastructure of HMM-based systems.
This paper presents the machine learning architecture of the Snips Voice Platform, a software solution to perform Spoken Language Understanding on microprocessors typical of IoT devices.
#7 best model for Speech Recognition on LibriSpeech test-other
We propose the use of a coupled 3D Convolutional Neural Network (3D-CNN) architecture that can map both modalities into a representation space to evaluate the correspondence of audio-visual streams using the learned multimodal features.