Improving the accuracy of single-channel automatic speech recognition (ASR) in noisy conditions is challenging.
In this paper, we explore an improved framework to train a monoaural neural enhancement model for robust speech recognition.
In this paper, we propose an online attention mechanism, known as cumulative attention (CA), for streaming Transformer-based automatic speech recognition (ASR).
Impressive progress in neural network-based single-channel speech source separation has been made in recent years.
The proposed method first uses mixtures of unseparated sources and the mixture invariant training (MixIT) criterion to train a teacher model.
Online Transformer-based automatic speech recognition (ASR) systems have been extensively studied due to the increasing demand for streaming applications.
In this paper, we present a novel multi-channel speech extraction system to simultaneously extract multiple clean individual sources from a mixture in noisy and reverberant environments.
To reduce the influence of reverberation on spatial feature extraction, a dereverberation pre-processing method has been applied to further improve the separation performance.
Despite the strong modeling power of neural network acoustic models, speech enhancement has been shown to deliver additional word error rate improvements if multi-channel data is available.