no code implementations • 9 Apr 2024 • Yuchen Zhu, Tianrong Chen, Evangelos A. Theodorou, Xie Chen, Molei Tao
This article considers the generative modeling of the states of quantum systems, and an approach based on denoising diffusion model is proposed.
no code implementations • 9 Apr 2024 • Yiwei Guo, Chenrun Wang, Yifan Yang, Hankun Wang, Ziyang Ma, Chenpeng Du, Shuai Wang, Hanzheng Li, Shuai Fan, HUI ZHANG, Xie Chen, Kai Yu
Discrete speech tokens have been more and more popular in multiple speech processing fields, including automatic speech recognition (ASR), text-to-speech (TTS) and singing voice synthesis (SVS).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 13 Feb 2024 • Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, JiaMing Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen
We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 2 Feb 2024 • Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, David Harwath
By integrating Spatial-AST with LLaMA-2 7B model, BAT transcends standard Sound Event Localization and Detection (SELD) tasks, enabling the model to reason about the relationships between the sounds in its environment.
no code implementations • 25 Jan 2024 • Chenpeng Du, Yiwei Guo, Hankun Wang, Yifan Yang, Zhikang Niu, Shuai Wang, HUI ZHANG, Xie Chen, Kai Yu
Recent TTS models with decoder-only Transformer architecture, such as SPEAR-TTS and VALL-E, achieve impressive naturalness and demonstrate the ability for zero-shot adaptation given a speech prompt.
no code implementations • 14 Jan 2024 • Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, Xie Chen
The language model (LM) approach based on acoustic and linguistic prompts, such as VALL-E, has achieved remarkable progress in the field of zero-shot audio generation.
1 code implementation • 7 Jan 2024 • Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, Xie Chen
Audio self-supervised learning (SSL) pre-training, which aims to learn good representations from unlabeled audio, has made remarkable progress.
2 code implementations • 23 Dec 2023 • Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, Xie Chen
To the best of our knowledge, emotion2vec is the first universal representation model in various emotion-related tasks, filling a gap in the field.
no code implementations • 14 Dec 2023 • Junjie Li, Yiwei Guo, Xie Chen, Kai Yu
Zero-shot voice conversion (VC) aims to transfer the source speaker timbre to arbitrary unseen target speaker timbre, while keeping the linguistic content unchanged.
no code implementations • 2 Nov 2023 • Hanglei Zhang, Yiwei Guo, Sen Liu, Xie Chen, Kai Yu
The LLM selects the best-matching style references from annotated utterances based on external style prompts, which can be raw input text or natural language style descriptions.
1 code implementation • 25 Sep 2023 • Guanrou Yang, Ziyang Ma, Zhisheng Zheng, Yakun Song, Zhikang Niu, Xie Chen
Recent years have witnessed significant advancements in self-supervised learning (SSL) methods for speech-processing tasks.
no code implementations • 19 Sep 2023 • Ziyang Ma, Wen Wu, Zhisheng Zheng, Yiwei Guo, Qian Chen, Shiliang Zhang, Xie Chen
In this paper, we explored how to boost speech emotion recognition (SER) with the state-of-the-art speech pre-trained model (PTM), data2vec, text generation technique, GPT-4, and speech synthesis technique, Azure TTS.
no code implementations • 18 Sep 2023 • Junzhe Liu, Jianwei Yu, Xie Chen
End-to-end models, such as the neural Transducer, have been successful in integrating acoustic and linguistic information jointly to achieve excellent recognition performance.
1 code implementation • 14 Sep 2023 • Yifan Yang, Feiyu Shen, Chenpeng Du, Ziyang Ma, Kai Yu, Daniel Povey, Xie Chen
Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into utilizing discrete tokens for speech tasks like recognition and translation, which offer lower storage requirements and great potential to employ natural language processing techniques.
no code implementations • 14 Sep 2023 • Peng Wang, Yifan Yang, Zheng Liang, Tian Tan, Shiliang Zhang, Xie Chen
In spite of the excellent strides made by end-to-end (E2E) models in speech recognition in recent years, named entity recognition is still challenging but critical for semantic understanding.
no code implementations • 10 Sep 2023 • Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, Kai Yu
Although diffusion models in text-to-speech have become a popular choice due to their strong generative ability, the intrinsic complexity of sampling from diffusion models harms their efficiency.
no code implementations • 28 Aug 2023 • Zhisheng Zheng, Ziyang Ma, Yu Wang, Xie Chen
In recent years, speech-based self-supervised learning (SSL) has made significant progress in various tasks, including automatic speech recognition (ASR).
no code implementations • 25 Jun 2023 • Sen Liu, Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu
Although high-fidelity speech can be obtained for intralingual speech synthesis, cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres(i. e. speaker similarity) and eliminate the accents from their first language(i. e. nativeness).
no code implementations • 23 Jun 2023 • Mingyu Cui, Jiawen Kang, Jiajun Deng, Xi Yin, Yutao Xie, Xie Chen, Xunying Liu
Current ASR systems are mainly trained and evaluated at the utterance level.
1 code implementation • 15 Jun 2023 • Ziyang Ma, Zhisheng Zheng, Guanrou Yang, Yu Wang, Chao Zhang, Xie Chen
Our models outperform other SSL models significantly on the LibriSpeech benchmark without the need for iterative re-clustering and re-training.
no code implementations • 14 Jun 2023 • Zheng Liang, Zheshu Song, Ziyang Ma, Chenpeng Du, Kai Yu, Xie Chen
Recently, end-to-end (E2E) automatic speech recognition (ASR) models have made great strides and exhibit excellent performance in general speech recognition.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +5
1 code implementation • 19 May 2023 • Yifan Yang, Xiaoyu Yang, Liyong Guo, Zengwei Yao, Wei Kang, Fangjun Kuang, Long Lin, Xie Chen, Daniel Povey
Neural Transducer and connectionist temporal classification (CTC) are popular end-to-end automatic speech recognition systems.
no code implementations • 30 Mar 2023 • Chenpeng Du, Qi Chen, Xie Chen, Kai Yu
Additionally, we propose a novel method for generating continuous video frames with the DDIM image decoder trained on individual frames, eliminating the need for modelling the joint distribution of consecutive frames directly.
no code implementations • 18 Feb 2023 • Xie Chen, Ziyang Ma, Changli Tang, Yujin Wang, Zhisheng Zheng
However, the training of SSL models is computationally expensive and a common practice is to fine-tune a released SSL model on the specific task.
no code implementations • 17 Nov 2022 • Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu
Specifically, instead of being guided with a one-hot vector for the specified emotion, EmoDiff is guided with a soft label where the value of the specified emotion and \textit{Neutral} is set to $\alpha$ and $1-\alpha$ respectively.
no code implementations • 17 Nov 2022 • Xun Gong, Yu Wu, Jinyu Li, Shujie Liu, Rui Zhao, Xie Chen, Yanmin Qian
This motivates us to leverage the factorized neural transducer structure, containing a real language model, the vocabulary predictor.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
1 code implementation • 14 Nov 2022 • Ziyang Ma, Zhisheng Zheng, Changli Tang, Yujin Wang, Xie Chen
In this paper, we provide a new perspective on self-supervised speech models from how the training targets are obtained.
Ranked #40 on Speech Recognition on LibriSpeech test-other
no code implementations • 11 Nov 2022 • Tianrui Wang, Xie Chen, Zhuo Chen, Shu Yu, Weibin Zhu
In recent years, self-supervised learning (SSL) has achieved tremendous success in various speech tasks due to its power to extract representations from massive unlabeled data.
no code implementations • 27 Oct 2022 • Yujin Wang, Changli Tang, Ziyang Ma, Zhisheng Zheng, Xie Chen, Wei-Qiang Zhang
Recent years have witnessed great strides in self-supervised learning (SSL) on the speech processing.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 2 Apr 2022 • Chenpeng Du, Yiwei Guo, Xie Chen, Kai Yu
The mainstream neural text-to-speech(TTS) pipeline is a cascade system, including an acoustic model(AM) that predicts acoustic feature from the input transcript and a vocoder that generates waveform according to the given acoustic feature.
no code implementations • 29 Nov 2021 • Junhao Xu, Xie Chen, Shoukang Hu, Jianwei Yu, Xunying Liu, Helen Meng
Index Terms: Language models, Recurrent neural networks, Quantization, Alternating direction methods of multipliers.
no code implementations • 6 Oct 2021 • Zhong Meng, Yashesh Gaur, Naoyuki Kanda, Jinyu Li, Xie Chen, Yu Wu, Yifan Gong
ILMA enables a fast text-only adaptation of the E2E model without increasing the run-time computational cost.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
1 code implementation • 27 Sep 2021 • Xie Chen, Zhong Meng, Sarangarajan Parthasarathy, Jinyu Li
In recent years, end-to-end (E2E) based automatic speech recognition (ASR) systems have achieved great success due to their simplicity and promising performance.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 4 Jun 2021 • Zhong Meng, Yu Wu, Naoyuki Kanda, Liang Lu, Xie Chen, Guoli Ye, Eric Sun, Jinyu Li, Yifan Gong
In this work, we perform LM fusion in the minimum WER (MWER) training of an E2E model to obviate the need for LM weights tuning during inference.
no code implementations • 2 Feb 2021 • Zhong Meng, Naoyuki Kanda, Yashesh Gaur, Sarangarajan Parthasarathy, Eric Sun, Liang Lu, Xie Chen, Jinyu Li, Yifan Gong
The efficacy of external language model (LM) integration with existing end-to-end (E2E) automatic speech recognition (ASR) systems can be improved significantly using the internal language model estimation (ILME) method.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 3 Nov 2020 • Zhong Meng, Sarangarajan Parthasarathy, Eric Sun, Yashesh Gaur, Naoyuki Kanda, Liang Lu, Xie Chen, Rui Zhao, Jinyu Li, Yifan Gong
The external language models (LM) integration remains a challenging task for end-to-end (E2E) automatic speech recognition (ASR) which has no clear division between acoustic and language models.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 22 Oct 2020 • Xie Chen, Yu Wu, Zhenghao Wang, Shujie Liu, Jinyu Li
Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition.
no code implementations • 21 Oct 2020 • Xie Chen, Sarangarajan Parthasarathy, William Gale, Shuangyu Chang, Michael Zeng
The context information is captured by the hidden states of LSTM-LMs across utterance and can be used to guide the first-pass search effectively.
1 code implementation • 16 Jun 2020 • Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, Matei Zaharia
Many state-of-the-art ML results have been obtained by scaling up the number of parameters in existing models.
no code implementations • 11 Nov 2019 • Sarangarajan Parthasarathy, William Gale, Xie Chen, George Polovets, Shuangyu Chang
We conduct language modeling and speech recognition experiments on the publicly available LibriSpeech corpus.
no code implementations • WS 2018 • Meng Zhang, Xie Chen, Ronan Cummins, {\O}istein E. Andersen, Ted Briscoe
Some language exams have multiple writing tasks.
no code implementations • ICASSP 2018 • Hainan Xu, Ke Li, Yiming Wang, Jian Wang, Shiyin Kang, Xie Chen, Daniel Povey, Sanjeev Khudanpur
In this paper we describe an extension of the Kaldi software toolkit to support neural-based language modeling, intended for use in automatic speech recognition (ASR) and related tasks.
Ranked #36 on Speech Recognition on LibriSpeech test-other (using extra training data)
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 1 Feb 2018 • Yu Wang, Xie Chen, Mark Gales, Anton Ragni, Jeremy Wong
As the combination approaches become more complicated the difference between the phonetic and graphemic systems further decreases.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 18 Aug 2017 • Xie Chen, Xunying Liu, Anton Ragni, Yu Wang, Mark Gales
Instead of using a recurrent unit to capture the complete future word contexts, a feedforward unit is used to model a finite number of succeeding, future, words.