Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines.
The model takes an input phoneme sequence, and through an iterative refinement process, generates an audio waveform.
This paper introduces a novel method to diagnose the source-target attention in state-of-the-art end-to-end speech recognition models with joint connectionist temporal classification (CTC) and attention training.
This paper introduces WaveGrad, a conditional model for waveform generation which estimates gradients of the data density.
In this work, Mask CTC model is trained using a Transformer encoder-decoder with joint training of mask prediction and CTC.
Audio and Speech Processing Sound
We show that the codebook learning can suffer from poor initialization and non-stationarity of clustered encoder outputs.
Then, we show the effect of emotion on speaker recognition.
Spoken language identification (LID) technologies have improved in recent years from discriminating largely distinct languages to discriminating highly similar languages or even dialects of the same language.
In this paper, we study two different non-autoregressive transformer structure for automatic speech recognition (ASR): A-CMLM and A-FMLM.
On BabyTrain corpus, we observe relative gains of 10. 38% and 12. 40% in minDCF and EER respectively.
While speaker adaptation for end-to-end speech synthesis using speaker embeddings can produce good speaker similarity for speakers seen during training, there remains a gap for zero-shot adaptation to unseen speakers.
Audio and Speech Processing
1 code implementation • 13 Sep 2019 • Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, Shinji Watanabe, Takenori Yoshimura, Wangyou Zhang
Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS).
Ranked #4 on Speech Recognition on AISHELL-1
We present JHU's system submission to the ASVspoof 2019 Challenge: Anti-Spoofing with Squeeze-Excitation and Residual neTworks (ASSERT).
no code implementations • 30 Mar 2018 • Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, Tsubasa Ochiai
This paper introduces a new open source platform for end-to-end speech processing named ESPnet.