Search Results for author: LiRong Dai

Found 19 papers, 6 papers with code

The USTC-NELSLIP Offline Speech Translation Systems for IWSLT 2022

no code implementations IWSLT (ACL) 2022 Weitai Zhang, Zhongyi Ye, Haitao Tang, Xiaoxi Li, Xinyuan Zhou, Jing Yang, Jianwei Cui, Dan Liu, Junhua Liu, LiRong Dai

This paper describes USTC-NELSLIP’s submissions to the IWSLT 2022 Offline Speech Translation task, including speech translation of talks from English to German, English to Chinese and English to Japanese.

Translation

SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-filter Model

no code implementations16 Oct 2024 Jianwei Cui, Yu Gu, Chao Weng, Jie Zhang, Liping Chen, LiRong Dai

This paper presents an advanced end-to-end singing voice synthesis (SVS) system based on the source-filter mechanism that directly translates lyrical and melodic cues into expressive and high-fidelity human-like singing.

Decoder Singing Voice Synthesis

LCM-SVC: Latent Diffusion Model Based Singing Voice Conversion with Inference Acceleration via Latent Consistency Distillation

no code implementations22 Aug 2024 Shihao Chen, Yu Gu, Jianwei Cui, Jie Zhang, Rilin Chen, LiRong Dai

We achieved one-step or few-step inference while maintaining the high performance by distilling a pre-trained LDM based SVC model, which had the advantages of timbre decoupling and sound quality.

Voice Conversion

LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer Guidance

no code implementations8 Jun 2024 Shihao Chen, Yu Gu, Jie Zhang, Na Li, Rilin Chen, Liping Chen, LiRong Dai

We pretrain a variational autoencoder structure using the noted open-source So-VITS-SVC project based on the VITS framework, which is then used for the LDM training.

Voice Conversion

Adversarial speech for voice privacy protection from Personalized Speech generation

no code implementations22 Jan 2024 Shihao Chen, Liping Chen, Jie Zhang, KongAik Lee, ZhenHua Ling, LiRong Dai

For validation, we employ the open-source pre-trained YourTTS model for speech generation and protect the target speaker's speech in the white-box scenario.

Speaker Verification Text to Speech +1

Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

1 code implementation7 Jan 2024 Qiushi Zhu, Jie Zhang, Yu Gu, Yuchen Hu, LiRong Dai

Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose a multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs.

Audio-Visual Speech Recognition Automatic Speech Recognition +7

Rep2wav: Noise Robust text-to-speech Using self-supervised representations

no code implementations28 Aug 2023 Qiushi Zhu, Yu Gu, Rilin Chen, Chao Weng, Yuchen Hu, LiRong Dai, Jie Zhang

Noise-robust TTS models are often trained using the enhanced speech, which thus suffer from speech distortion and background noise that affect the quality of the synthesized speech.

Speech Enhancement Text to Speech

VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

no code implementations21 Nov 2022 Qiushi Zhu, Long Zhou, Ziqiang Zhang, Shujie Liu, Binxing Jiao, Jie Zhang, LiRong Dai, Daxin Jiang, Jinyu Li, Furu Wei

Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e. g., vision, text.

Audio-Visual Speech Recognition Language Modelling +4

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

1 code implementation30 Sep 2022 Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu, Shuo Ren, Shujie Liu, Zhuoyuan Yao, Xun Gong, LiRong Dai, Jinyu Li, Furu Wei

In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation.

Language Modelling speech-recognition +1

Vision-Language Adaptive Mutual Decoder for OOV-STR

no code implementations2 Sep 2022 Jinshui Hu, Chenyu Liu, Qiandong Yan, Xuyang Zhu, Jiajia Wu, Jun Du, LiRong Dai

However, in real-world scenarios, out-of-vocabulary (OOV) words are of great importance and SOTA recognition models usually perform poorly on OOV settings.

Decoder Language Modelling +2

MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services

1 code implementation20 May 2022 dianhai yu, Liang Shen, Hongxiang Hao, Weibao Gong, HuaChao Wu, Jiang Bian, LiRong Dai, Haoyi Xiong

For scalable inference in a single node, especially when the model size is larger than GPU memory, MoESys builds the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference.

Distributed Computing

Speech-MLP: a simple MLP architecture for speech processing

no code implementations29 Sep 2021 Chao Xing, Dong Wang, LiRong Dai, Qun Liu, Anderson Avila

Overparameterized transformer-based architectures have shown remarkable performance in recent years, achieving state-of-the-art results in speech processing tasks such as speech recognition, speech synthesis, keyword spotting, and speech enhancement et al.

Keyword Spotting Speech Enhancement +3

Multi-Task Learning with High-Order Statistics for X-vector based Text-Independent Speaker Verification

no code implementations28 Mar 2019 Lanhua You, Wu Guo, LiRong Dai, Jun Du

The x-vector based deep neural network (DNN) embedding systems have demonstrated effectiveness for text-independent speaker verification.

Multi-Task Learning Text-Independent Speaker Verification

Watch, attend and parse: An end-to-end neural network based approach to handwritten mathematical expression recognition

1 code implementation Pattern Recognition 2017 Jianshu Zhang, Jun Du, Shiliang Zhang, Dan Liu, Yulong Hu, Jinshui Hu, Si Wei, LiRong Dai

We employ a convolutional neural network encoder that takes HME images as input as the watcher and employ a recurrent neural network decoder equipped with an attention mechanism as the parser to generate LaTeX sequences.

Decoder Handwritten Mathmatical Expression Recognition

Cannot find the paper you are looking for? You can Submit a new open access paper.