The multiple-hypothesis approach yields a relative reduction of 3. 3% WER on the CHiME-4's single-channel real noisy evaluation set when compared with the single-hypothesis approach.
Improving the accuracy of single-channel automatic speech recognition (ASR) in noisy conditions is challenging.
In this paper, we explore an improved framework to train a monoaural neural enhancement model for robust speech recognition.
A major bottleneck for building statistical spoken dialogue systems for new domains and applications is the need for large amounts of training data.
In this paper, we propose an online attention mechanism, known as cumulative attention (CA), for streaming Transformer-based automatic speech recognition (ASR).
Models trained on mixed corpora can be more stable in mismatched contexts, and the performance reductions range from 1 to 8% when compared with single corpus models in matched conditions.
Impressive progress in neural network-based single-channel speech source separation has been made in recent years.
A user input to a schema-driven dialogue information navigation system, such as venue search, is typically constrained by the underlying database which restricts the user to specify a predefined set of preferences, or slots, corresponding to the database fields.
The proposed method first uses mixtures of unseparated sources and the mixture invariant training (MixIT) criterion to train a teacher model.
Online Transformer-based automatic speech recognition (ASR) systems have been extensively studied due to the increasing demand for streaming applications.
In this method, multiple automatic speech recognition (ASR) 1-best hypotheses are integrated in the computation of the connectionist temporal classification (CTC) loss function.
Although the lower layers of a deep neural network learn features which are transferable across datasets, these layers are not transferable within the same dataset.
In this paper, we present a novel multi-channel speech extraction system to simultaneously extract multiple clean individual sources from a mixture in noisy and reverberant environments.
To reduce the influence of reverberation on spatial feature extraction, a dereverberation pre-processing method has been applied to further improve the separation performance.
Utterance interpretation is one of the main functions of a dialogue manager, which is the key component of a dialogue system.
Despite the strong modeling power of neural network acoustic models, speech enhancement has been shown to deliver additional word error rate improvements if multi-channel data is available.
Interpreting the top layers as a classifier and the lower layers a feature extractor, one can hypothesize that unwanted network convergence may occur when the classifier has overfit with respect to the feature extractor.
The USFD primary system incorporates state-of-the-art ASR and MT techniques and gives a BLEU score of 23. 45 and 14. 75 on the English-to-French and English-to-German speech-to-text translation task with the IWSLT 2014 data.