Search Results for author: Sheng Zhao

Found 48 papers, 17 papers with code

Large-Scale Automatic Audiobook Creation

no code implementations7 Sep 2023 Brendan Walsh, Mark Hamilton, Greg Newby, Xi Wang, Serena Ruan, Sheng Zhao, Lei He, Shaofei Zhang, Eric Dettinger, William T. Freeman, Markus Weimer

In this work, we present a system that can automatically generate high-quality audiobooks from online e-books.

MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023

no code implementations6 Sep 2023 Zhihang Xu, Shaofei Zhang, Xi Wang, Jiajun Zhang, Wenning Wei, Lei He, Sheng Zhao

In this paper, we present MuLanTTS, the Microsoft end-to-end neural text-to-speech (TTS) system designed for the Blizzard Challenge 2023.

Speech Synthesis

PromptTTS 2: Describing and Generating Voices with Text Prompt

no code implementations5 Sep 2023 Yichong Leng, Zhifang Guo, Kai Shen, Xu Tan, Zeqian Ju, Yanqing Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He, Xiang-Yang Li, Sheng Zhao, Tao Qin, Jiang Bian

TTS approaches based on the text prompt face two challenges: 1) the one-to-many problem, where not all details about voice variability can be described in the text prompt, and 2) the limited availability of text prompt datasets, where vendors and large cost of data labeling are required to write text prompt for speech.

Language Modelling Large Language Model

VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer

no code implementations9 Aug 2023 Liyang Chen, Zhiyong Wu, Runnan Li, Weihong Bao, Jun Ling, Xu Tan, Sheng Zhao

With our essential designs on facial style learning, our model is able to flexibly capture the expressive facial style from arbitrary video prompts and transfer it onto a personalized image renderer in a zero-shot manner.

Style Transfer Talking Face Generation

The detection and rectification for identity-switch based on unfalsified control

no code implementations27 Jul 2023 Junchao Huang, Xiaoqi He, Sheng Zhao

The purpose of multi-object tracking (MOT) is to continuously track and identify objects detected in videos.

Multi-Object Tracking

ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading

no code implementations3 Jul 2023 Yujia Xiao, Shaofei Zhang, Xi Wang, Xu Tan, Lei He, Sheng Zhao, Frank K. Soong, Tan Lee

Experiments show that ContextSpeech significantly improves the voice quality and prosody expressiveness in paragraph reading with competitive model efficiency.

An End-to-End Multi-Module Audio Deepfake Generation System for ADD Challenge 2023

no code implementations3 Jul 2023 Sheng Zhao, Qilong Yuan, Yibo Duan, Zhuoyue Chen

The task of synthetic speech generation is to generate language content from a given text, then simulating fake human voice. The key factors that determine the effect of synthetic speech generation mainly include speed of generation, accuracy of word segmentation, naturalness of synthesized speech, etc.

Face Swapping

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

no code implementations18 Apr 2023 Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, Jiang Bian

To enhance the zero-shot capability that is important to achieve diverse speech synthesis, we design a speech prompting mechanism to facilitate in-context learning in the diffusion model and the duration/pitch predictor.

Speech Synthesis

AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models

no code implementations3 Apr 2023 Yuancheng Wang, Zeqian Ju, Xu Tan, Lei He, Zhizheng Wu, Jiang Bian, Sheng Zhao

Recently, some diffusion-based methods achieved zero-shot audio editing by using a diffusion and denoising process conditioned on the text description of the output audio.

Ranked #3 on Audio Generation on AudioCaps (FD metric, using extra training data)

Audio Generation Denoising +1

DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder

no code implementations30 Mar 2023 Chenpng Du, Qi Chen, Tianyu He, Xu Tan, Xie Chen, Kai Yu, Sheng Zhao, Jiang Bian

Additionally, we propose a novel method for generating continuous video frames with the DDIM image decoder trained on individual frames, eliminating the need for modelling the joint distribution of consecutive frames directly.

Talking Face Generation

FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model

no code implementations6 Mar 2023 Ruiqing Xue, Yanqing Liu, Lei He, Xu Tan, Linquan Liu, Edward Lin, Sheng Zhao

Neural text-to-speech (TTS) generally consists of cascaded architecture with separately optimized acoustic model and vocoder, or end-to-end architecture with continuous mel-spectrograms or self-extracted speech frames as the intermediate representations to bridge acoustic model and vocoder, which suffers from two limitations: 1) the continuous acoustic frames are hard to predict with phoneme only, and acoustic information like duration or pitch is also needed to solve the one-to-many problem, which is not easy to scale on large scale and noise datasets; 2) to achieve diverse speech output based on continuous speech features, complex VAE or flow-based models are usually required.

Language Modelling Large Language Model +1

Improving Contextual Spelling Correction by External Acoustics Attention and Semantic Aware Data Augmentation

no code implementations22 Feb 2023 Xiaoqiang Wang, Yanqing Liu, Jinyu Li, Sheng Zhao

To solve above limitations, in this paper we propose an improved non-autoregressive (NAR) spelling correction model for contextual biasing in E2E neural transducer-based ASR systems to improve the previous CSC model from two perspectives: Firstly, we incorporate acoustics information with an external attention as well as text hypotheses into CSC to better distinguish target phrase from dissimilar or irrelevant phrases.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech

no code implementations30 Dec 2022 Zehua Chen, Yihan Wu, Yichong Leng, Jiawei Chen, Haohe Liu, Xu Tan, Yang Cui, Ke Wang, Lei He, Sheng Zhao, Jiang Bian, Danilo Mandic

Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples.


Memories are One-to-Many Mapping Alleviators in Talking Face Generation

no code implementations9 Dec 2022 Anni Tang, Tianyu He, Xu Tan, Jun Ling, Runnan Li, Sheng Zhao, Li Song, Jiang Bian

More specifically, the implicit memory is employed in the audio-to-expression model to capture high-level semantics in the audio-expression shared space, while the explicit memory is employed in the neural-rendering model to help synthesize pixel-level details.

Neural Rendering Talking Face Generation

VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing

1 code implementation30 Nov 2022 Yihan Wu, Junliang Guo, Xu Tan, Chen Zhang, Bohan Li, Ruihua Song, Lei He, Sheng Zhao, Arul Menezes, Jiang Bian

In this paper, we propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation, to match the length of source and target speech.

Machine Translation speech-recognition +3

PromptTTS: Controllable Text-to-Speech with Text Descriptions

no code implementations22 Nov 2022 Zhifang Guo, Yichong Leng, Yihan Wu, Sheng Zhao, Xu Tan

Thus, we develop a text-to-speech (TTS) system (dubbed as PromptTTS) that takes a prompt with both style and content descriptions as input to synthesize the corresponding speech.

Speech Synthesis

MeloForm: Generating Melody with Musical Form based on Expert Systems and Neural Networks

1 code implementation30 Aug 2022 Peiling Lu, Xu Tan, Botao Yu, Tao Qin, Sheng Zhao, Tie-Yan Liu

Specifically, 1) we design an expert system to generate a melody by developing musical elements from motifs to phrases then to sections with repetitions and variations according to pre-given musical form; 2) considering the generated melody is lack of musical richness, we design a Transformer based refinement model to improve the melody without changing its musical form.

Music Generation

StableFace: Analyzing and Improving Motion Stability for Talking Face Generation

no code implementations29 Aug 2022 Jun Ling, Xu Tan, Liyang Chen, Runnan Li, Yuchao Zhang, Sheng Zhao, Li Song

In this paper, we conduct systematic analyses on the motion jittering problem based on a state-of-the-art pipeline that uses 3D face representations to bridge the input audio and output video, and improve the motion stability with a series of effective designs.

Talking Face Generation Video Generation

RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion

no code implementations28 Jun 2022 Dacheng Yin, Chuanxin Tang, Yanqing Liu, Xiaoqiang Wang, Zhiyuan Zhao, Yucheng Zhao, Zhiwei Xiong, Sheng Zhao, Chong Luo

In the proposed paradigm, global and local factors in speech are explicitly decomposed and separately manipulated to achieve high speaker similarity and continuous prosody.

BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis

1 code implementation30 May 2022 Yichong Leng, Zehua Chen, Junliang Guo, Haohe Liu, Jiawei Chen, Xu Tan, Danilo Mandic, Lei He, Xiang-Yang Li, Tao Qin, Sheng Zhao, Tie-Yan Liu

Combining this novel perspective of two-stage synthesis with advanced generative models (i. e., the diffusion models), the proposed BinauralGrad is able to generate accurate and high-fidelity binaural audio samples.

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

3 code implementations9 May 2022 Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, YuanHao Yi, Lei He, Frank Soong, Tao Qin, Sheng Zhao, Tie-Yan Liu

In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.

 Ranked #1 on Text-To-Speech Synthesis on LJSpeech (using extra training data)

Speech Synthesis Text-To-Speech Synthesis

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios

no code implementations1 Apr 2022 Yihan Wu, Xu Tan, Bohan Li, Lei He, Sheng Zhao, Ruihua Song, Tao Qin, Tie-Yan Liu

We model the speaker characteristics systematically to improve the generalization on new speakers.

Speech Synthesis

Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech

no code implementations31 Mar 2022 Guangyan Zhang, Kaitao Song, Xu Tan, Daxin Tan, Yuzi Yan, Yanqing Liu, Gang Wang, Wei Zhou, Tao Qin, Tan Lee, Sheng Zhao

However, the works apply pre-training with character-based units to enhance the TTS phoneme encoder, which is inconsistent with the TTS fine-tuning that takes phonemes as input.

Towards Contextual Spelling Correction for Customization of End-to-end Speech Recognition Systems

1 code implementation2 Mar 2022 Xiaoqiang Wang, Yanqing Liu, Jinyu Li, Veljko Miljanic, Sheng Zhao, Hosam Khalil

In this work, we introduce a novel approach to do contextual biasing by adding a contextual spelling correction model on top of the end-to-end ASR system.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

InferGrad: Improving Diffusion Models for Vocoder by Considering Inference in Training

no code implementations8 Feb 2022 Zehua Chen, Xu Tan, Ke Wang, Shifeng Pan, Danilo Mandic, Lei He, Sheng Zhao

In this paper, we propose InferGrad, a diffusion model for vocoder that incorporates inference process into training, to reduce the inference iterations while maintaining high generation quality.


DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021

1 code implementation25 Oct 2021 Yanqing Liu, Zhihang Xu, Gang Wang, Kuan Chen, Bohan Li, Xu Tan, Jinzhu Li, Lei He, Sheng Zhao

The goal of this challenge is to synthesize natural and high-quality speech from text, and we approach this goal in two perspectives: The first is to directly model and generate waveform in 48 kHz sampling rate, which brings higher perception quality than previous systems with 16 kHz or 24 kHz sampling rate; The second is to model the variation information in speech through a systematic design, which improves the prosody and naturalness.

Speech Synthesis

A study on the efficacy of model pre-training in developing neural text-to-speech system

no code implementations8 Oct 2021 Guangyan Zhang, Yichong Leng, Daxin Tan, Ying Qin, Kaitao Song, Xu Tan, Sheng Zhao, Tan Lee

However, in terms of ultimately achieved system performance for target speaker(s), the actual benefits of model pre-training are uncertain and unstable, depending very much on the quantity and text content of training data.

A Light-weight contextual spelling correction model for customizing transducer-based speech recognition systems

no code implementations17 Aug 2021 Xiaoqiang Wang, Yanqing Liu, Sheng Zhao, Jinyu Li

We incorporate the context information into the spelling correction model with a shared context encoder and use a filtering algorithm to handle large-size context lists.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style

no code implementations6 Jul 2021 Yuzi Yan, Xu Tan, Bohan Li, Guangyan Zhang, Tao Qin, Sheng Zhao, Yuan Shen, Wei-Qiang Zhang, Tie-Yan Liu

While recent text to speech (TTS) models perform very well in synthesizing reading-style (e. g., audiobook) speech, it is still challenging to synthesize spontaneous-style speech (e. g., podcast or conversation), mainly because of two reasons: 1) the lack of training data for spontaneous speech; 2) the difficulty in modeling the filled pauses (um and uh) and diverse rhythms in spontaneous speech.

AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data

1 code implementation20 Apr 2021 Yuzi Yan, Xu Tan, Bohan Li, Tao Qin, Sheng Zhao, Yuan Shen, Tie-Yan Liu

In adaptation, we use untranscribed speech data for speech reconstruction and only fine-tune the TTS decoder.

AdaSpeech: Adaptive Text to Speech for Custom Voice

2 code implementations ICLR 2021 Mingjian Chen, Xu Tan, Bohan Li, Yanqing Liu, Tao Qin, Sheng Zhao, Tie-Yan Liu

2) To better trade off the adaptation parameters and voice quality, we introduce conditional layer normalization in the mel-spectrogram decoder of AdaSpeech, and fine-tune this part in addition to speaker embedding for adaptation.

Denoising Text to Speech with Frame-Level Noise Modeling

no code implementations17 Dec 2020 Chen Zhang, Yi Ren, Xu Tan, Jinglin Liu, Kejun Zhang, Tao Qin, Sheng Zhao, Tie-Yan Liu

In DenoiSpeech, we handle real-world noisy speech by modeling the fine-grained frame-level noise with a noise condition module, which is jointly trained with the TTS model.


LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition

no code implementations9 Aug 2020 Jin Xu, Xu Tan, Yi Ren, Tao Qin, Jian Li, Sheng Zhao, Tie-Yan Liu

However, there are more than 6, 000 languages in the world and most languages are lack of speech training data, which poses significant challenges when building TTS and ASR systems for extremely low-resource languages.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability

no code implementations30 Jul 2020 Jinyu Li, Rui Zhao, Zhong Meng, Yanqing Liu, Wenning Wei, Sarangarajan Parthasarathy, Vadim Mazalov, Zhenghao Wang, Lei He, Sheng Zhao, Yifan Gong

Because of its streaming nature, recurrent neural network transducer (RNN-T) is a very promising end-to-end (E2E) model that may replace the popular hybrid model for automatic speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

MultiSpeech: Multi-Speaker Text to Speech with Transformer

1 code implementation8 Jun 2020 Mingjian Chen, Xu Tan, Yi Ren, Jin Xu, Hao Sun, Sheng Zhao, Tao Qin, Tie-Yan Liu

Transformer-based text to speech (TTS) model (e. g., Transformer TTS~\cite{li2019neural}, FastSpeech~\cite{ren2019fastspeech}) has shown the advantages of training and inference efficiency over RNN-based model (e. g., Tacotron~\cite{shen2018natural}) due to its parallel computation in training and/or inference.

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

32 code implementations ICLR 2021 Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e. g., pitch, energy and more accurate duration) as conditional inputs.

Ranked #6 on Text-To-Speech Synthesis on LJSpeech (using extra training data)

Knowledge Distillation Speech Synthesis +1

A Study of Non-autoregressive Model for Sequence Generation

no code implementations ACL 2020 Yi Ren, Jinglin Liu, Xu Tan, Zhou Zhao, Sheng Zhao, Tie-Yan Liu

In this work, we conduct a study to understand the difficulty of NAR sequence generation and try to answer: (1) Why NAR models can catch up with AR models in some tasks but not all?

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Semantic Mask for Transformer based End-to-End Speech Recognition

1 code implementation6 Dec 2019 Chengyi Wang, Yu Wu, Yujiao Du, Jinyu Li, Shujie Liu, Liang Lu, Shuo Ren, Guoli Ye, Sheng Zhao, Ming Zhou

Attention-based encoder-decoder model has achieved impressive results for both automatic speech recognition (ASR) and text-to-speech (TTS) tasks.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

FastSpeech: Fast, Robust and Controllable Text to Speech

21 code implementations NeurIPS 2019 Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS.

Ranked #10 on Text-To-Speech Synthesis on LJSpeech (using extra training data)

Speech Synthesis Text-To-Speech Synthesis

FastSpeech: Fast,Robustand Controllable Text-to-Speech

11 code implementations22 May 2019 Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i. e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control).

Text-To-Speech Synthesis

Almost Unsupervised Text to Speech and Automatic Speech Recognition

no code implementations13 May 2019 Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

Text to speech (TTS) and automatic speech recognition (ASR) are two dual tasks in speech processing and both achieve impressive performance thanks to the recent advance in deep learning and large amount of aligned speech and text data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Neural Speech Synthesis with Transformer Network

5 code implementations19 Sep 2018 Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, Ming Zhou

Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs).

Ranked #9 on Text-To-Speech Synthesis on LJSpeech (using extra training data)

Machine Translation NMT +2

Cannot find the paper you are looking for? You can Submit a new open access paper.