no code implementations • IWSLT (ACL) 2022 • Weitai Zhang, Zhongyi Ye, Haitao Tang, Xiaoxi Li, Xinyuan Zhou, Jing Yang, Jianwei Cui, Dan Liu, Junhua Liu, LiRong Dai
This paper describes USTC-NELSLIP’s submissions to the IWSLT 2022 Offline Speech Translation task, including speech translation of talks from English to German, English to Chinese and English to Japanese.
no code implementations • 16 Oct 2024 • Jianwei Cui, Yu Gu, Chao Weng, Jie Zhang, Liping Chen, LiRong Dai
This paper presents an advanced end-to-end singing voice synthesis (SVS) system based on the source-filter mechanism that directly translates lyrical and melodic cues into expressive and high-fidelity human-like singing.
no code implementations • 26 Sep 2024 • Shifu Xiong, Mengzhi Wang, Genshun Wan, Hang Chen, Jianqing Gao, LiRong Dai
In this work, we propose deep CLAS to use contextual information better.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 22 Aug 2024 • Shihao Chen, Yu Gu, Jianwei Cui, Jie Zhang, Rilin Chen, LiRong Dai
We achieved one-step or few-step inference while maintaining the high performance by distilling a pre-trained LDM based SVC model, which had the advantages of timbre decoupling and sound quality.
no code implementations • 8 Jun 2024 • Shihao Chen, Yu Gu, Jie Zhang, Na Li, Rilin Chen, Liping Chen, LiRong Dai
We pretrain a variational autoencoder structure using the noted open-source So-VITS-SVC project based on the VITS framework, which is then used for the LDM training.
no code implementations • 22 Jan 2024 • Shihao Chen, Liping Chen, Jie Zhang, KongAik Lee, ZhenHua Ling, LiRong Dai
For validation, we employ the open-source pre-trained YourTTS model for speech generation and protect the target speaker's speech in the white-box scenario.
1 code implementation • 7 Jan 2024 • Qiushi Zhu, Jie Zhang, Yu Gu, Yuchen Hu, LiRong Dai
Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose a multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs.
Audio-Visual Speech Recognition Automatic Speech Recognition +7
no code implementations • 28 Aug 2023 • Qiushi Zhu, Yu Gu, Rilin Chen, Chao Weng, Yuchen Hu, LiRong Dai, Jie Zhang
Noise-robust TTS models are often trained using the enhanced speech, which thus suffer from speech distortion and background noise that affect the quality of the synthesized speech.
no code implementations • ICCV 2023 • Shan He, Haonan He, Shuo Yang, Xiaoyan Wu, Pengcheng Xia, Bing Yin, Cong Liu, LiRong Dai, Chang Xu
Besides, we also verify that the proposed framework is able to explicitly control the emotion of the animated talking face.
no code implementations • 21 Nov 2022 • Qiushi Zhu, Long Zhou, Ziqiang Zhang, Shujie Liu, Binxing Jiao, Jie Zhang, LiRong Dai, Daxin Jiang, Jinyu Li, Furu Wei
Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e. g., vision, text.
2 code implementations • 7 Oct 2022 • Ziqiang Zhang, Long Zhou, Junyi Ao, Shujie Liu, LiRong Dai, Jinyu Li, Furu Wei
The rapid development of single-modal pre-training has prompted researchers to pay more attention to cross-modal pre-training methods.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
1 code implementation • 30 Sep 2022 • Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu, Shuo Ren, Shujie Liu, Zhuoyuan Yao, Xun Gong, LiRong Dai, Jinyu Li, Furu Wei
In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation.
no code implementations • 2 Sep 2022 • Jinshui Hu, Chenyu Liu, Qiandong Yan, Xuyang Zhu, Jiajia Wu, Jun Du, LiRong Dai
However, in real-world scenarios, out-of-vocabulary (OOV) words are of great importance and SOTA recognition models usually perform poorly on OOV settings.
1 code implementation • 20 May 2022 • dianhai yu, Liang Shen, Hongxiang Hao, Weibao Gong, HuaChao Wu, Jiang Bian, LiRong Dai, Haoyi Xiong
For scalable inference in a single node, especially when the model size is larger than GPU memory, MoESys builds the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference.
1 code implementation • 31 Mar 2022 • Junyi Ao, Ziqiang Zhang, Long Zhou, Shujie Liu, Haizhou Li, Tom Ko, LiRong Dai, Jinyu Li, Yao Qian, Furu Wei
In this way, the decoder learns to reconstruct original speech information with codes before learning to generate correct text.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +5
no code implementations • 29 Sep 2021 • Chao Xing, Dong Wang, LiRong Dai, Qun Liu, Anderson Avila
Overparameterized transformer-based architectures have shown remarkable performance in recent years, achieving state-of-the-art results in speech processing tasks such as speech recognition, speech synthesis, keyword spotting, and speech enhancement et al.
no code implementations • ACL (IWSLT) 2021 • Dan Liu, Mengge Du, Xiaoxi Li, Yuchen Hu, LiRong Dai
This paper describes USTC-NELSLIP's submissions to the IWSLT2021 Simultaneous Speech Translation task.
no code implementations • 28 Mar 2019 • Lanhua You, Wu Guo, LiRong Dai, Jun Du
The x-vector based deep neural network (DNN) embedding systems have demonstrated effectiveness for text-independent speaker verification.
1 code implementation • Pattern Recognition 2017 • Jianshu Zhang, Jun Du, Shiliang Zhang, Dan Liu, Yulong Hu, Jinshui Hu, Si Wei, LiRong Dai
We employ a convolutional neural network encoder that takes HME images as input as the watcher and employ a recurrent neural network decoder equipped with an attention mechanism as the parser to generate LaTeX sequences.