Diverse promising datasets have been designed to hold back the development of fake audio detection, such as ASVspoof databases.
It takes a lot of computation and time to predict the blank tokens, but only the non-blank tokens will appear in the final output sequence.
To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT), which improves the performance and accelerating the convergence of the NAR model by learning prior knowledge from a parameters-sharing AR model.
Based on this idea, we propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once).
This paper proposes a deep time delay neural network (TDNN) for speech enhancement with full data learning.
The joint training framework for speech enhancement and recognition methods have obtained quite good performances for robust end-to-end automatic speech recognition (ASR).
Inspired by the success of two-pass end-to-end models, we introduce a transformer decoder and the two-stage inference method into the streaming CTC model.
In this paper, we propose a decoupled transformer model to use monolingual paired data and unpaired text data to alleviate the problem of code-switching data shortage.
To address this problem and improve the inference speed, we propose a spike-triggered non-autoregressive transformer model for end-to-end speech recognition, which introduces a CTC module to predict the length of the target sequence and accelerate the convergence.
Without beam-search, the one-pass propagation much reduces inference time cost of LASO.
In this paper, we propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features, which is based on the deep clustering (DC).
Secondly, to pay more attention to the outputs of the pre-separation stage, an attention module is applied to acquire deep attention fusion features, which are extracted by computing the similarity between the mixture and the pre-separated speech.
Recently, language identity information has been utilized to improve the performance of end-to-end code-switching (CS) speech recognition.
Specifically, we apply the deep clustering network to extract deep embedding features.
Once a fixed-length chunk of the input sequence is processed by the encoder, the decoder begins to predict symbols immediately.
To alleviate the above two issues, we propose a unified method called LST (Learn Spelling from Teachers) to integrate knowledge into an AED model from the external text-only data and leverage the whole context in a sentence.
Furthermore, a path-aware regularization is proposed to assist SA-T to learn alignments and improve the performance.
Firstly, a DC network is trained to extract deep embedding features, which contain each source's information and have an advantage in discriminating each target speakers.
Integrating an external language model into a sequence-to-sequence speech recognition system is non-trivial.