no code implementations • 1 Jun 2023 • Zih-Ching Chen, Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Shou-Yiin Chang, Rohit Prabhavalkar, Hung-Yi Lee, Tara N. Sainath
In this work, we introduce a ``score-based assessment'' framework for estimating the transferability of pre-trained speech models (PSMs) for fine-tuning target tasks.
no code implementations • 2 Mar 2023 • Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk, Françoise Beaufays, Yonghui Wu
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
no code implementations • 8 Feb 2023 • Qingqing Huang, Daniel S. Park, Tao Wang, Timo I. Denk, Andy Ly, Nanxin Chen, Zhengdong Zhang, Zhishuai Zhang, Jiahui Yu, Christian Frank, Jesse Engel, Quoc V. Le, William Chan, Zhifeng Chen, Wei Han
We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts.
no code implementations • 19 Jan 2023 • Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Rohit Prabhavalkar, Tara N. Sainath, Trevor Strohman
In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 2 Nov 2022 • Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Tara N. Sainath, Sabato Marco Siniscalchi, Chin-Hui Lee
We propose a quantum kernel learning (QKL) framework to address the inherent data sparsity issues often encountered in training large-scare acoustic models in low-resource scenarios.
no code implementations • 28 Oct 2022 • Nobuyuki Morioka, Heiga Zen, Nanxin Chen, Yu Zhang, Yifan Ding
Adapting a neural text-to-speech (TTS) model to a target speaker typically involves fine-tuning most if not all of the parameters of a pretrained multi-speaker backbone model.
no code implementations • 18 Oct 2022 • Zhehuai Chen, Ankur Bapna, Andrew Rosenberg, Yu Zhang, Bhuvana Ramabhadran, Pedro Moreno, Nanxin Chen
First, we show that by combining speech representations with byte-level text representations and use of language embeddings, we can dramatically reduce the Character Error Rate (CER) on languages with no supervised speech from 64. 8\% to 30. 8\%, a relative reduction of 53\%.
no code implementations • 31 Mar 2022 • Yuma Koizumi, Heiga Zen, Kohei Yatabe, Nanxin Chen, Michiel Bacchiani
Neural vocoder using denoising diffusion probabilistic model (DDPM) has been improved by adaptation of the diffusion noise distribution to given acoustic features.
no code implementations • 11 Oct 2021 • Yosuke Higuchi, Nanxin Chen, Yuya Fujita, Hirofumi Inaguma, Tatsuya Komatsu, Jaesong Lee, Jumon Nozaki, Tianzi Wang, Shinji Watanabe
Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
2 code implementations • 17 Jun 2021 • Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, Najim Dehak, William Chan
The model takes an input phoneme sequence, and through an iterative refinement process, generates an audio waveform.
no code implementations • 2 Nov 2020 • Nanxin Chen, Piotr Żelasko, Jesús Villalba, Najim Dehak
This paper introduces a novel method to diagnose the source-target attention in state-of-the-art end-to-end speech recognition models with joint connectionist temporal classification (CTC) and attention training.
6 code implementations • ICLR 2021 • Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, William Chan
This paper introduces WaveGrad, a conditional model for waveform generation which estimates gradients of the data density.
no code implementations • 18 May 2020 • Yosuke Higuchi, Shinji Watanabe, Nanxin Chen, Tetsuji Ogawa, Tetsunori Kobayashi
In this work, Mask CTC model is trained using a Transformer encoder-decoder with joint training of mask prediction and CTC.
Audio and Speech Processing Sound
1 code implementation • 18 May 2020 • Adrian Łańcucki, Jan Chorowski, Guillaume Sanchez, Ricard Marxer, Nanxin Chen, Hans J. G. A. Dolfing, Sameer Khurana, Tanel Alumäe, Antoine Laurent
We show that the codebook learning can suffer from poor initialization and non-stationarity of clustered encoder outputs.
no code implementations • 12 Feb 2020 • Raghavendra Pappagari, Tianzi Wang, Jesus Villalba, Nanxin Chen, Najim Dehak
Then, we show the effect of emotion on speaker recognition.
no code implementations • 29 Jan 2020 • Andrew Titus, Jan Silovsky, Nanxin Chen, Roger Hsiao, Mary Young, Arnab Ghoshal
Spoken language identification (LID) technologies have improved in recent years from discriminating largely distinct languages to discriminating highly similar languages or even dialects of the same language.
no code implementations • 10 Nov 2019 • Nanxin Chen, Shinji Watanabe, Jesús Villalba, Najim Dehak
In this paper, we study two different non-autoregressive transformer structure for automatic speech recognition (ASR): A-CMLM and A-FMLM.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
1 code implementation • 25 Oct 2019 • Saurabh Kataria, Phani Sankar Nidadavolu, Jesús Villalba, Nanxin Chen, Paola García, Najim Dehak
On BabyTrain corpus, we observe relative gains of 10. 38% and 12. 40% in minDCF and EER respectively.
3 code implementations • 23 Oct 2019 • Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Fuming Fang, Xin Wang, Nanxin Chen, Junichi Yamagishi
While speaker adaptation for end-to-end speech synthesis using speaker embeddings can produce good speaker similarity for speakers seen during training, there remains a gap for zero-shot adaptation to unseen speakers.
Audio and Speech Processing
1 code implementation • 13 Sep 2019 • Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, Shinji Watanabe, Takenori Yoshimura, Wangyou Zhang
Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS).
Ranked #7 on
Speech Recognition
on AISHELL-1
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
1 code implementation • 1 Apr 2019 • Cheng-I Lai, Nanxin Chen, Jesús Villalba, Najim Dehak
We present JHU's system submission to the ASVspoof 2019 Challenge: Anti-Spoofing with Squeeze-Excitation and Residual neTworks (ASSERT).
no code implementations • 30 Mar 2018 • Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, Tsubasa Ochiai
This paper introduces a new open source platform for end-to-end speech processing named ESPnet.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1