Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines.
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a Transformer-based sequence-to-sequence model for automatic speech recognition (ASR) capable of simultaneously transcribing and annotating audio with linguistic information such as phonemic transcripts or part-of-speech (POS) tags.
We propose a system to concatenate audio segmentation and non-autoregressive ASR to realize high accuracy and low RTF ASR.
One NAT model, mask-predict, has been applied to ASR but the model needs some heuristics or additional component to estimate the length of the output token sequence.
Audio and Speech Processing Sound
This report investigates the ability of E2E ASR from standard close-talk to far-field applications by encompassing entire multichannel speech enhancement and ASR components within the S2S model.
The proposed method firstly utilizes a DNN-based mask estimator to separate the mixture signal into the keyword signal uttered by the target speaker and the remaining background speech.