Search Results for author: Xu Tan

Found 113 papers, 48 papers with code

Non-Autoregressive Sequence Generation

no code implementations ACL 2022 Jiatao Gu, Xu Tan

Non-autoregressive sequence generation (NAR) attempts to generate the entire or partial output sequences in parallel to speed up the generation process and avoid potential issues (e. g., label bias, exposure bias) in autoregressive generation.

Machine Translation With Weakly Paired Bilingual Documents

no code implementations ICLR 2019 Lijun Wu, Jinhua Zhu, Di He, Fei Gao, Xu Tan, Tao Qin, Tie-Yan Liu

Neural machine translation, which achieves near human-level performance in some languages, strongly relies on the availability of large amounts of parallel sentences, which hinders its applicability to low-resource language pairs.

Translation Unsupervised Machine Translation

ProphetChat: Enhancing Dialogue Generation with Simulation of Future Conversation

no code implementations ACL 2022 Chang Liu, Xu Tan, Chongyang Tao, Zhenxin Fu, Dongyan Zhao, Tie-Yan Liu, Rui Yan

To enable the chatbot to foresee the dialogue future, we design a beam-search-like roll-out strategy for dialogue future simulation using a typical dialogue generation model and a dialogue selector.

Dialogue Generation Response Generation

TranSFormer: Slow-Fast Transformer for Machine Translation

no code implementations26 May 2023 Bei Li, Yi Jing, Xu Tan, Zhen Xing, Tong Xiao, Jingbo Zhu

Learning multiscale Transformer models has been evidenced as a viable approach to augmenting machine translation systems.

Machine Translation Translation

NAS-FM: Neural Architecture Search for Tunable and Interpretable Sound Synthesis based on Frequency Modulation

no code implementations22 May 2023 Zhen Ye, Wei Xue, Xu Tan, Qifeng Liu, Yike Guo

Since expert knowledge is hard to acquire, it hinders the flexibility to quickly design and tune digital synthesizers for diverse sounds.

Neural Architecture Search

DiffusionNER: Boundary Diffusion for Named Entity Recognition

1 code implementation22 May 2023 Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, Yueting Zhuang

In this paper, we propose DiffusionNER, which formulates the named entity recognition task as a boundary-denoising diffusion process and thus generates named entities from noisy spans.

Denoising named-entity-recognition +2

GETMusic: Generating Any Music Tracks with a Unified Representation and Diffusion Framework

1 code implementation18 May 2023 Ang Lv, Xu Tan, Peiling Lu, Wei Ye, Shikun Zhang, Jiang Bian, Rui Yan

With separate tracks in GETScore and the non-autoregressive behavior of the model, GETMusic can explicitly control the generation of any target tracks from scratch or conditioning on source tracks.

Denoising Music Generation

CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model

no code implementations11 May 2023 Zhen Ye, Wei Xue, Xu Tan, Jie Chen, Qifeng Liu, Yike Guo

In this paper, we propose a "Co"nsistency "Mo"del-based "Speech" synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step while achieving high audio quality.

Denoising Singing Voice Synthesis +1

ResiDual: Transformer with Dual Residual Connections

1 code implementation28 Apr 2023 Shufang Xie, Huishuai Zhang, Junliang Guo, Xu Tan, Jiang Bian, Hany Hassan Awadalla, Arul Menezes, Tao Qin, Rui Yan

In this paper, we propose ResiDual, a novel Transformer architecture with Pre-Post-LN (PPLN), which fuses the connections in Post-LN and Pre-LN together and inherits their advantages while avoids their limitations.

Machine Translation

CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic Music Information Retrieval

2 code implementations21 Apr 2023 Shangda Wu, Dingyao Yu, Xu Tan, Maosong Sun

We introduce CLaMP: Contrastive Language-Music Pre-training, which learns cross-modal representations between natural language and symbolic music using a music encoder and a text encoder trained jointly with a contrastive loss.

Data Augmentation Information Retrieval +4

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

no code implementations18 Apr 2023 Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, Jiang Bian

To enhance the zero-shot capability that is important to achieve diverse speech synthesis, we design a speech prompting mechanism to facilitate in-context learning in the diffusion model and the duration/pitch predictor.

Speech Synthesis

AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models

no code implementations3 Apr 2023 Yuancheng Wang, Zeqian Ju, Xu Tan, Lei He, Zhizheng Wu, Jiang Bian, Sheng Zhao

Recently, some diffusion-based methods achieved zero-shot audio editing by using a diffusion and denoising process conditioned on the text description of the output audio.

 Ranked #1 on Audio Generation on AudioCaps (FD metric, using extra training data)

Audio Generation Denoising +1

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

3 code implementations30 Mar 2023 Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, Yueting Zhuang

Solving complicated AI tasks with different domains and modalities is a key step toward artificial general intelligence.

DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder

no code implementations30 Mar 2023 Chenpng Du, Qi Chen, Tianyu He, Xu Tan, Xie Chen, Kai Yu, Sheng Zhao, Jiang Bian

Additionally, we propose a novel method for generating continuous video frames with the DDIM image decoder trained on individual frames, eliminating the need for modelling the joint distribution of consecutive frames directly.

Talking Face Generation

HiFace: High-Fidelity 3D Face Reconstruction by Learning Static and Dynamic Details

no code implementations20 Mar 2023 Zenghao Chai, Tianke Zhang, Tianyu He, Xu Tan, Tadas Baltrusaitis, HsiangTao Wu, Runnan Li, Sheng Zhao, Chun Yuan, Jiang Bian

3D Morphable Models (3DMMs) demonstrate great potential for reconstructing faithful and animatable 3D facial surfaces from a single image.

3D Face Reconstruction

FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model

no code implementations6 Mar 2023 Ruiqing Xue, Yanqing Liu, Lei He, Xu Tan, Linquan Liu, Edward Lin, Sheng Zhao

Neural text-to-speech (TTS) generally consists of cascaded architecture with separately optimized acoustic model and vocoder, or end-to-end architecture with continuous mel-spectrograms or self-extracted speech frames as the intermediate representations to bridge acoustic model and vocoder, which suffers from two limitations: 1) the continuous acoustic frames are hard to predict with phoneme only, and acoustic information like duration or pitch is also needed to solve the one-to-many problem, which is not easy to scale on large scale and noise datasets; 2) to achieve diverse speech output based on continuous speech features, complex VAE or flow-based models are usually required.

Language Modelling Speech Synthesis

A Study on ReLU and Softmax in Transformer

no code implementations13 Feb 2023 Kai Shen, Junliang Guo, Xu Tan, Siliang Tang, Rui Wang, Jiang Bian

This paper sheds light on the following points: 1) Softmax and ReLU use different normalization methods over elements which lead to different variances of results, and ReLU is good at dealing with a large number of key-value slots; 2) FFN and key-value memory are equivalent, and thus the Transformer can be viewed as a memory network where FFNs and self-attention networks are both key-value memories.

Document Translation

N-Gram Nearest Neighbor Machine Translation

no code implementations30 Jan 2023 Rui Lv, Junliang Guo, Rui Wang, Xu Tan, Qi Liu, Tao Qin

Nearest neighbor machine translation augments the Autoregressive Translation~(AT) with $k$-nearest-neighbor retrieval, by comparing the similarity between the token-level context representations of the target tokens in the query and the datastore.

Domain Adaptation Machine Translation +2

ERA-Solver: Error-Robust Adams Solver for Fast Sampling of Diffusion Probabilistic Models

no code implementations30 Jan 2023 Shengmeng Li, Luping Liu, Zenghao Chai, Runnan Li, Xu Tan

Different from the traditional predictor based on explicit Adams methods, we leverage a Lagrange interpolation function as the predictor, which is further enhanced with an error-robust strategy to adaptively select the Lagrange bases with lower error in the estimated noise.

Denoising Image Generation

Regeneration Learning: A Learning Paradigm for Data Generation

no code implementations21 Jan 2023 Xu Tan, Tao Qin, Jiang Bian, Tie-Yan Liu, Yoshua Bengio

Regeneration learning extends the concept of representation learning to data generation tasks, and can be regarded as a counterpart of traditional representation learning, since 1) regeneration learning handles the abstraction (Y') of the target data Y for data generation while traditional representation learning handles the abstraction (X') of source data X for data understanding; 2) both the processes of Y'-->Y in regeneration learning and X-->X' in representation learning can be learned in a self-supervised way (e. g., pre-training); 3) both the mappings from X to Y' in regeneration learning and from X' to Y in representation learning are simpler than the direct mapping from X to Y.

Image Generation Representation Learning +6

WuYun: Exploring hierarchical skeleton-guided melody generation using knowledge-enhanced deep learning

1 code implementation11 Jan 2023 Kejun Zhang, Xinda Wu, Tieyao Zhang, Zhijie Huang, Xu Tan, Qihao Liang, Songruoyao Wu, Lingyun Sun

Although deep learning has revolutionized music generation, existing methods for structured melody generation follow an end-to-end left-to-right note-by-note generative paradigm and treat each note equally.

Music Generation

ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech

no code implementations30 Dec 2022 Zehua Chen, Yihan Wu, Yichong Leng, Jiawei Chen, Haohe Liu, Xu Tan, Yang Cui, Ke Wang, Lei He, Sheng Zhao, Jiang Bian, Danilo Mandic

Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples.


Difformer: Empowering Diffusion Models on the Embedding Space for Text Generation

no code implementations19 Dec 2022 Zhujin Gao, Junliang Guo, Xu Tan, Yongxin Zhu, Fang Zhang, Jiang Bian, Linli Xu

Diffusion models have achieved state-of-the-art synthesis quality on both visual and audio tasks, and recent works further adapt them to textual data by diffusing on the embedding space.

Denoising Machine Translation +2

Memories are One-to-Many Mapping Alleviators in Talking Face Generation

no code implementations9 Dec 2022 Anni Tang, Tianyu He, Xu Tan, Jun Ling, Runnan Li, Sheng Zhao, Li Song, Jiang Bian

More specifically, the implicit memory is employed in the audio-to-expression model to capture high-level semantics in the audio-expression shared space, while the explicit memory is employed in the neural-rendering model to help synthesize pixel-level details.

Neural Rendering Talking Face Generation

SoftCorrect: Error Correction with Soft Detection for Automatic Speech Recognition

1 code implementation2 Dec 2022 Yichong Leng, Xu Tan, Wenjie Liu, Kaitao Song, Rui Wang, Xiang-Yang Li, Tao Qin, Edward Lin, Tie-Yan Liu

In this paper, we propose SoftCorrect with a soft error detection mechanism to avoid the limitations of both explicit and implicit error detection.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing

1 code implementation30 Nov 2022 Yihan Wu, Junliang Guo, Xu Tan, Chen Zhang, Bohan Li, Ruihua Song, Lei He, Sheng Zhao, Arul Menezes, Jiang Bian

In this paper, we propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation, to match the length of source and target speech.

Machine Translation speech-recognition +3

Mask the Correct Tokens: An Embarrassingly Simple Approach for Error Correction

1 code implementation23 Nov 2022 Kai Shen, Yichong Leng, Xu Tan, Siliang Tang, Yuan Zhang, Wenjie Liu, Edward Lin

Since the error rate of the incorrect sentence is usually low (e. g., 10\%), the correction model can only learn to correct on limited error tokens but trivially copy on most tokens (correct tokens), which harms the effective training of error correction.

speech-recognition Speech Recognition

PromptTTS: Controllable Text-to-Speech with Text Descriptions

no code implementations22 Nov 2022 Zhifang Guo, Yichong Leng, Yihan Wu, Sheng Zhao, Xu Tan

Thus, we develop a text-to-speech (TTS) system (dubbed as PromptTTS) that takes a prompt with both style and content descriptions as input to synthesize the corresponding speech.

Speech Synthesis

Towards Understanding Omission in Dialogue Summarization

1 code implementation14 Nov 2022 Yicheng Zou, Kaitao Song, Xu Tan, Zhongkai Fu, Qi Zhang, Dongsheng Li, Tao Gui

By analyzing this dataset, we find that a large improvement in summarization quality can be achieved by providing ground-truth omission labels for the summarization model to recover omission information, which demonstrates the importance of omission detection for omission mitigation in dialogue summarization.

Museformer: Transformer with Fine- and Coarse-Grained Attention for Music Generation

1 code implementation19 Oct 2022 Botao Yu, Peiling Lu, Rui Wang, Wei Hu, Xu Tan, Wei Ye, Shikun Zhang, Tao Qin, Tie-Yan Liu

A recent trend is to use Transformer or its variants in music generation, which is, however, suboptimal, because the full attention cannot efficiently model the typically long music sequences (e. g., over 10, 000 tokens), and the existing models have shortcomings in generating musical repetition structures.

Music Generation

MeloForm: Generating Melody with Musical Form based on Expert Systems and Neural Networks

1 code implementation30 Aug 2022 Peiling Lu, Xu Tan, Botao Yu, Tao Qin, Sheng Zhao, Tie-Yan Liu

Specifically, 1) we design an expert system to generate a melody by developing musical elements from motifs to phrases then to sections with repetitions and variations according to pre-given musical form; 2) considering the generated melody is lack of musical richness, we design a Transformer based refinement model to improve the melody without changing its musical form.

Music Generation

StableFace: Analyzing and Improving Motion Stability for Talking Face Generation

no code implementations29 Aug 2022 Jun Ling, Xu Tan, Liyang Chen, Runnan Li, Yuchao Zhang, Sheng Zhao, Li Song

In this paper, we conduct systematic analyses on the motion jittering problem based on a state-of-the-art pipeline that uses 3D face representations to bridge the input audio and output video, and improve the motion stability with a series of effective designs.

Talking Face Generation Video Generation

Re-creation of Creations: A New Paradigm for Lyric-to-Melody Generation

1 code implementation11 Aug 2022 Ang Lv, Xu Tan, Tao Qin, Tie-Yan Liu, Rui Yan

These characteristics cannot be well handled by neural generation models that learn lyric-to-melody mapping in an end-to-end way, due to several issues: (1) lack of aligned lyric-melody training data to sufficiently learn lyric-melody feature alignment; (2) lack of controllability in generation to better and explicitly align the lyric-melody features.

Language Modelling Retrieval

A Study of Syntactic Multi-Modality in Non-Autoregressive Machine Translation

no code implementations NAACL 2022 Kexun Zhang, Rui Wang, Xu Tan, Junliang Guo, Yi Ren, Tao Qin, Tie-Yan Liu

Furthermore, we take the best of both and design a new loss function to better handle the complicated syntactic multi-modality in real-world datasets.

Machine Translation Translation

BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis

1 code implementation30 May 2022 Yichong Leng, Zehua Chen, Junliang Guo, Haohe Liu, Jiawei Chen, Xu Tan, Danilo Mandic, Lei He, Xiang-Yang Li, Tao Qin, Sheng Zhao, Tie-Yan Liu

Combining this novel perspective of two-stage synthesis with advanced generative models (i. e., the diffusion models), the proposed BinauralGrad is able to generate accurate and high-fidelity binaural audio samples.

Transcormer: Transformer for Sentence Scoring with Sliding Language Modeling

no code implementations25 May 2022 Kaitao Song, Yichong Leng, Xu Tan, Yicheng Zou, Tao Qin, Dongsheng Li

Previous works on sentence scoring mainly adopted either causal language modeling (CLM) like GPT or masked language modeling (MLM) like BERT, which have some limitations: 1) CLM only utilizes unidirectional information for the probability estimation of a sentence without considering bidirectional context, which affects the scoring quality; 2) MLM can only estimate the probability of partial tokens at a time and thus requires multiple forward passes to estimate the probability of the whole sentence, which incurs large computation and time cost.

Causal Language Modeling Language Modelling +1

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

2 code implementations9 May 2022 Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, YuanHao Yi, Lei He, Frank Soong, Tao Qin, Sheng Zhao, Tie-Yan Liu

In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.

Speech Synthesis Text-To-Speech Synthesis

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios

no code implementations1 Apr 2022 Yihan Wu, Xu Tan, Bohan Li, Lei He, Sheng Zhao, Ruihua Song, Tao Qin, Tie-Yan Liu

We model the speaker characteristics systematically to improve the generalization on new speakers.

Speech Synthesis

Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech

no code implementations31 Mar 2022 Guangyan Zhang, Kaitao Song, Xu Tan, Daxin Tan, Yuzi Yan, Yanqing Liu, Gang Wang, Wei Zhou, Tao Qin, Tan Lee, Sheng Zhao

However, the works apply pre-training with character-based units to enhance the TTS phoneme encoder, which is inconsistent with the TTS fine-tuning that takes phonemes as input.

Revisiting Over-Smoothness in Text to Speech

no code implementations ACL 2022 Yi Ren, Xu Tan, Tao Qin, Zhou Zhao, Tie-Yan Liu

Then we conduct a comprehensive study on NAR-TTS models that use some advanced modeling methods.

InferGrad: Improving Diffusion Models for Vocoder by Considering Inference in Training

no code implementations8 Feb 2022 Zehua Chen, Xu Tan, Ke Wang, Shifeng Pan, Danilo Mandic, Lei He, Sheng Zhao

In this paper, we propose InferGrad, a diffusion model for vocoder that incorporates inference process into training, to reduce the inference iterations while maintaining high generation quality.


Speech-T: Transducer for Text to Speech and Beyond

no code implementations NeurIPS 2021 Jiawei Chen, Xu Tan, Yichong Leng, Jin Xu, Guihua Wen, Tao Qin, Tie-Yan Liu

Experiments on LJSpeech datasets demonstrate that Speech-T 1) is more robust than the attention based autoregressive TTS model due to its inherent monotonic alignments between text and speech; 2) naturally supports streaming TTS with good voice quality; and 3) enjoys the benefit of joint modeling TTS and ASR in a single network.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021

1 code implementation25 Oct 2021 Yanqing Liu, Zhihang Xu, Gang Wang, Kuan Chen, Bohan Li, Xu Tan, Jinzhu Li, Lei He, Sheng Zhao

The goal of this challenge is to synthesize natural and high-quality speech from text, and we approach this goal in two perspectives: The first is to directly model and generate waveform in 48 kHz sampling rate, which brings higher perception quality than previous systems with 16 kHz or 24 kHz sampling rate; The second is to model the variation information in speech through a systematic design, which improves the prosody and naturalness.

Speech Synthesis

A study on the efficacy of model pre-training in developing neural text-to-speech system

no code implementations8 Oct 2021 Guangyan Zhang, Yichong Leng, Daxin Tan, Ying Qin, Kaitao Song, Xu Tan, Sheng Zhao, Tan Lee

However, in terms of ultimately achieved system performance for target speaker(s), the actual benefits of model pre-training are uncertain and unstable, depending very much on the quantity and text content of training data.

FastCorrect 2: Fast Error Correction on Multiple Candidates for Automatic Speech Recognition

1 code implementation Findings (EMNLP) 2021 Yichong Leng, Xu Tan, Rui Wang, Linchen Zhu, Jin Xu, Wenjie Liu, Linquan Liu, Tao Qin, Xiang-Yang Li, Edward Lin, Tie-Yan Liu

Although multiple candidates are generated by an ASR system through beam search, current error correction approaches can only correct one sentence at a time, failing to leverage the voting effect from multiple candidates to better detect and correct error tokens.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

TeleMelody: Lyric-to-Melody Generation with a Template-Based Two-Stage Method

1 code implementation20 Sep 2021 Zeqian Ju, Peiling Lu, Xu Tan, Rui Wang, Chen Zhang, Songruoyao Wu, Kejun Zhang, Xiangyang Li, Tao Qin, Tie-Yan Liu

In this paper, we develop TeleMelody, a two-stage lyric-to-melody generation system with music template (e. g., tonality, chord progression, rhythm pattern, and cadence) to bridge the gap between lyrics and melodies (i. e., the system consists of a lyric-to-template module and a template-to-melody module).

Analyzing and Mitigating Interference in Neural Architecture Search

no code implementations29 Aug 2021 Jin Xu, Xu Tan, Kaitao Song, Renqian Luo, Yichong Leng, Tao Qin, Tie-Yan Liu, Jian Li

In this paper, we investigate the interference issue by sampling different child models and calculating the gradient similarity of shared operators, and observe: 1) the interference on a shared operator between two child models is positively correlated with the number of different operators; 2) the interference is smaller when the inputs and outputs of the shared operator are more similar.

Neural Architecture Search Reading Comprehension

A Survey on Low-Resource Neural Machine Translation

no code implementations9 Jul 2021 Rui Wang, Xu Tan, Renqian Luo, Tao Qin, Tie-Yan Liu

Neural approaches have achieved state-of-the-art accuracy on machine translation but suffer from the high cost of collecting large scale parallel data.

Low-Resource Neural Machine Translation NMT +1

AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style

no code implementations6 Jul 2021 Yuzi Yan, Xu Tan, Bohan Li, Guangyan Zhang, Tao Qin, Sheng Zhao, Yuan Shen, Wei-Qiang Zhang, Tie-Yan Liu

While recent text to speech (TTS) models perform very well in synthesizing reading-style (e. g., audiobook) speech, it is still challenging to synthesize spontaneous-style speech (e. g., podcast or conversation), mainly because of two reasons: 1) the lack of training data for spontaneous speech; 2) the difficulty in modeling the filled pauses (um and uh) and diverse rhythms in spontaneous speech.

DeepRapper: Neural Rap Generation with Rhyme and Rhythm Modeling

1 code implementation ACL 2021 Lanqing Xue, Kaitao Song, Duocai Wu, Xu Tan, Nevin L. Zhang, Tao Qin, Wei-Qiang Zhang, Tie-Yan Liu

In this paper, we develop DeepRapper, a Transformer-based rap generation system that can model both rhymes and rhythms.

Language Modelling

A Survey on Neural Speech Synthesis

1 code implementation29 Jun 2021 Xu Tan, Tao Qin, Frank Soong, Tie-Yan Liu

Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad applications in the industry.

Speech Synthesis

MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training

2 code implementations Findings (ACL) 2021 Mingliang Zeng, Xu Tan, Rui Wang, Zeqian Ju, Tao Qin, Tie-Yan Liu

Inspired by the success of pre-training models in natural language processing, in this paper, we develop MusicBERT, a large-scale pre-trained model for music understanding.

Classification Emotion Classification +2

FastCorrect: Fast Error Correction with Edit Alignment for Automatic Speech Recognition

1 code implementation NeurIPS 2021 Yichong Leng, Xu Tan, Linchen Zhu, Jin Xu, Renqian Luo, Linquan Liu, Tao Qin, Xiang-Yang Li, Ed Lin, Tie-Yan Liu

A straightforward solution to reduce latency, inspired by non-autoregressive (NAR) neural machine translation, is to use an NAR sequence generation model for ASR error correction, which, however, comes at the cost of significantly increased ASR error rate.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

A Rational Inattention Theory of Echo Chamber

no code implementations21 Apr 2021 Lin Hu, Anqi Li, Xu Tan

Finite players allocate limited attention capacities between biased primary sources and the other players in order to gather information about an uncertain state.

AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data

1 code implementation20 Apr 2021 Yuzi Yan, Xu Tan, Bohan Li, Tao Qin, Sheng Zhao, Yuan Shen, Tie-Yan Liu

In adaptation, we use untranscribed speech data for speech reconstruction and only fine-tune the TTS decoder.

Adaptive Logit Adjustment Loss for Long-Tailed Visual Recognition

no code implementations13 Apr 2021 Yan Zhao, Weicong Chen, Xu Tan, Kai Huang, Jihong Zhu

The adaptive adjusting term is composed of two complementary factors: 1) quantity factor, which pays more attention to tail classes, and 2) difficulty factor, which adaptively pays more attention to hard instances in the training process.

General Classification Semantic Similarity +1

AdaSpeech: Adaptive Text to Speech for Custom Voice

2 code implementations ICLR 2021 Mingjian Chen, Xu Tan, Bohan Li, Yanqing Liu, Tao Qin, Sheng Zhao, Tie-Yan Liu

2) To better trade off the adaptation parameters and voice quality, we introduce conditional layer normalization in the mel-spectrogram decoder of AdaSpeech, and fine-tune this part in addition to speaker embedding for adaptation.

MixSpeech: Data Augmentation for Low-resource Automatic Speech Recognition

no code implementations25 Feb 2021 Linghui Meng, Jin Xu, Xu Tan, Jindong Wang, Tao Qin, Bo Xu

In this paper, we propose MixSpeech, a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction

2 code implementations ICLR 2021 Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, Shi Gu

To further employ the power of quantization, the mixed precision technique is incorporated in our framework by approximating the inter-layer and intra-layer sensitivity.

Image Classification object-detection +2

Task-Agnostic and Adaptive-Size BERT Compression

no code implementations1 Jan 2021 Jin Xu, Xu Tan, Renqian Luo, Kaitao Song, Li Jian, Tao Qin, Tie-Yan Liu

NAS-BERT trains a big supernet on a carefully designed search space containing various architectures and outputs multiple compressed models with adaptive sizes and latency.

Language Modelling Model Compression +1

Denoising Text to Speech with Frame-Level Noise Modeling

no code implementations17 Dec 2020 Chen Zhang, Yi Ren, Xu Tan, Jinglin Liu, Kejun Zhang, Tao Qin, Sheng Zhao, Tie-Yan Liu

In DenoiSpeech, we handle real-world noisy speech by modeling the fine-grained frame-level noise with a noise condition module, which is jointly trained with the TTS model.


SongMASS: Automatic Song Writing with Pre-training and Alignment Constraint

1 code implementation9 Dec 2020 Zhonghao Sheng, Kaitao Song, Xu Tan, Yi Ren, Wei Ye, Shikun Zhang, Tao Qin

Automatic song writing aims to compose a song (lyric and/or melody) by machine, which is an interesting topic in both academia and industry.

Speech enhancement aided end-to-end multi-task learning for voice activity detection

no code implementations23 Oct 2020 Xu Tan, Xiao-Lei Zhang

Recent studies show that speech enhancement is helpful to VAD, but the performance improvement is limited.

Action Detection Activity Detection +3

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

1 code implementation3 Sep 2020 Jiawei Chen, Xu Tan, Jian Luan, Tao Qin, Tie-Yan Liu

To tackle the difficulty of singing modeling caused by high sampling rate (wider frequency band and longer waveform), we introduce multi-scale adversarial training in both the acoustic model and vocoder to improve singing modeling.

Singing Voice Synthesis Vocal Bursts Intensity Prediction

PopMAG: Pop Music Accompaniment Generation

1 code implementation18 Aug 2020 Yi Ren, Jinzheng He, Xu Tan, Tao Qin, Zhou Zhao, Tie-Yan Liu

To improve harmony, in this paper, we propose a novel MUlti-track MIDI representation (MuMIDI), which enables simultaneous multi-track generation in a single sequence and explicitly models the dependency of the notes from different tracks.

Music Modeling

LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition

no code implementations9 Aug 2020 Jin Xu, Xu Tan, Yi Ren, Tao Qin, Jian Li, Sheng Zhao, Tie-Yan Liu

However, there are more than 6, 000 languages in the world and most languages are lack of speech training data, which poses significant challenges when building TTS and ASR systems for extremely low-resource languages.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Neural Machine Translation with Error Correction

1 code implementation21 Jul 2020 Kaitao Song, Xu Tan, Jianfeng Lu

Neural machine translation (NMT) generates the next target token given as input the previous ground truth target tokens during training while the previous generated target tokens during inference, which causes discrepancy between training and inference as well as error propagation, and affects the translation accuracy.

Machine Translation NMT +1

Accuracy Prediction with Non-neural Model for Neural Architecture Search

1 code implementation9 Jul 2020 Renqian Luo, Xu Tan, Rui Wang, Tao Qin, Enhong Chen, Tie-Yan Liu

Considering that most architectures are represented as sequences of discrete symbols which are more like tabular data and preferred by non-neural predictors, in this paper, we study an alternative approach which uses non-neural model for accuracy prediction.

Neural Architecture Search

DeepSinger: Singing Voice Synthesis with Data Mined From the Web

no code implementations9 Jul 2020 Yi Ren, Xu Tan, Tao Qin, Jian Luan, Zhou Zhao, Tie-Yan Liu

DeepSinger has several advantages over previous SVS systems: 1) to the best of our knowledge, it is the first SVS system that directly mines training data from music websites, 2) the lyrics-to-singing alignment model further avoids any human efforts for alignment labeling and greatly reduces labeling cost, 3) the singing model based on a feed-forward Transformer is simple and efficient, by removing the complicated acoustic feature modeling in parametric synthesis and leveraging a reference encoder to capture the timbre of a singer from noisy singing data, and 4) it can synthesize singing voices in multiple languages and multiple singers.

Singing Voice Synthesis

SimulSpeech: End-to-End Simultaneous Speech to Text Translation

no code implementations ACL 2020 Yi Ren, Jinglin Liu, Xu Tan, Chen Zhang, Tao Qin, Zhou Zhao, Tie-Yan Liu

In this work, we develop SimulSpeech, an end-to-end simultaneous speech to text translation system which translates speech in source language to text in target language concurrently.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +6

UWSpeech: Speech to Speech Translation for Unwritten Languages

no code implementations14 Jun 2020 Chen Zhang, Xu Tan, Yi Ren, Tao Qin, Ke-jun Zhang, Tie-Yan Liu

Existing speech to speech translation systems heavily rely on the text of target language: they usually translate source language either to target text and then synthesize target speech from text, or directly to target speech with target text for auxiliary training.

speech-recognition Speech Recognition +2

XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System

no code implementations11 Jun 2020 Peiling Lu, Jie Wu, Jian Luan, Xu Tan, Li Zhou

This paper presents XiaoiceSing, a high-quality singing voice synthesis system which employs an integrated network for spectrum, F0 and duration modeling.

Singing Voice Synthesis Vocal Bursts Intensity Prediction

MultiSpeech: Multi-Speaker Text to Speech with Transformer

1 code implementation8 Jun 2020 Mingjian Chen, Xu Tan, Yi Ren, Jin Xu, Hao Sun, Sheng Zhao, Tao Qin, Tie-Yan Liu

Transformer-based text to speech (TTS) model (e. g., Transformer TTS~\cite{li2019neural}, FastSpeech~\cite{ren2019fastspeech}) has shown the advantages of training and inference efficiency over RNN-based model (e. g., Tacotron~\cite{shen2018natural}) due to its parallel computation in training and/or inference.

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

31 code implementations ICLR 2021 Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e. g., pitch, energy and more accurate duration) as conditional inputs.

Ranked #6 on Text-To-Speech Synthesis on LJSpeech (using extra training data)

Knowledge Distillation Speech Synthesis +1

LightPAFF: A Two-Stage Distillation Framework for Pre-training and Fine-tuning

no code implementations27 Apr 2020 Kaitao Song, Hao Sun, Xu Tan, Tao Qin, Jianfeng Lu, Hongzhi Liu, Tie-Yan Liu

While pre-training and fine-tuning, e. g., BERT~\citep{devlin2018bert}, GPT-2~\citep{radford2019language}, have achieved great success in language understanding and generation tasks, the pre-trained models are usually too big for online deployment in terms of both memory cost and inference speed, which hinders them from practical online usage.

Knowledge Distillation Language Modelling

A Study of Non-autoregressive Model for Sequence Generation

no code implementations ACL 2020 Yi Ren, Jinglin Liu, Xu Tan, Zhou Zhao, Sheng Zhao, Tie-Yan Liu

In this work, we conduct a study to understand the difficulty of NAR sequence generation and try to answer: (1) Why NAR models can catch up with AR models in some tasks but not all?

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

MPNet: Masked and Permuted Pre-training for Language Understanding

6 code implementations NeurIPS 2020 Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu

Since BERT neglects dependency among predicted tokens, XLNet introduces permuted language modeling (PLM) for pre-training to address this problem.

Language Modelling Masked Language Modeling

A Study of Multilingual Neural Machine Translation

no code implementations25 Dec 2019 Xu Tan, Yichong Leng, Jiale Chen, Yi Ren, Tao Qin, Tie-Yan Liu

Multilingual neural machine translation (NMT) has recently been investigated from different aspects (e. g., pivot translation, zero-shot translation, fine-tuning, or training from scratch) and in different settings (e. g., rich resource and low resource, one-to-many, and many-to-one translation).

Machine Translation NMT +1

Fine-Tuning by Curriculum Learning for Non-Autoregressive Neural Machine Translation

2 code implementations20 Nov 2019 Junliang Guo, Xu Tan, Linli Xu, Tao Qin, Enhong Chen, Tie-Yan Liu

Non-autoregressive translation (NAT) models remove the dependence on previous target tokens and generate all target tokens in parallel, resulting in significant inference speedup but at the cost of inferior translation accuracy compared to autoregressive translation (AT) models.

Machine Translation Translation

ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

3 code implementations24 Oct 2019 Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, Takenori Yoshimura, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Yu Zhang, Xu Tan

Furthermore, the unified design enables the integration of ASR functions with TTS, e. g., ASR-based objective evaluation and semi-supervised learning with both ASR and TTS models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Efficient Bidirectional Neural Machine Translation

no code implementations25 Aug 2019 Xu Tan, Yingce Xia, Lijun Wu, Tao Qin

In this paper, we propose an efficient method to generate a sequence in both left-to-right and right-to-left manners using a single encoder and decoder, combining the advantages of both generation directions.

Machine Translation Translation

Multilingual Neural Machine Translation with Language Clustering

no code implementations IJCNLP 2019 Xu Tan, Jiale Chen, Di He, Yingce Xia, Tao Qin, Tie-Yan Liu

We study two methods for language clustering: (1) using prior knowledge, where we cluster languages according to language family, and (2) using language embedding, in which we represent each language by an embedding vector and cluster them in the embedding space.

Machine Translation NMT +1

Language Graph Distillation for Low-Resource Machine Translation

no code implementations17 Aug 2019 Tianyu He, Jiale Chen, Xu Tan, Tao Qin

Neural machine translation on low-resource language is challenging due to the lack of bilingual sentence pairs.

Knowledge Distillation Machine Translation +2

Representation Degeneration Problem in Training Natural Language Generation Models

no code implementations ICLR 2019 Jun Gao, Di He, Xu Tan, Tao Qin, Li-Wei Wang, Tie-Yan Liu

We study an interesting problem in training neural network-based models for natural language generation tasks, which we call the \emph{representation degeneration problem}.

Language Modelling Machine Translation +3

Unsupervised Pivot Translation for Distant Languages

no code implementations ACL 2019 Yichong Leng, Xu Tan, Tao Qin, Xiang-Yang Li, Tie-Yan Liu

In this work, we introduce unsupervised pivot translation for distant languages, which translates a language to a distant language through multiple hops, and the unsupervised translation on each hop is relatively easier than the original direct translation.

Machine Translation NMT +1

FastSpeech: Fast, Robust and Controllable Text to Speech

21 code implementations NeurIPS 2019 Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS.

Ranked #10 on Text-To-Speech Synthesis on LJSpeech (using extra training data)

Speech Synthesis Text-To-Speech Synthesis

FastSpeech: Fast,Robustand Controllable Text-to-Speech

10 code implementations22 May 2019 Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i. e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control).

Text-To-Speech Synthesis

Almost Unsupervised Text to Speech and Automatic Speech Recognition

no code implementations13 May 2019 Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

Text to speech (TTS) and automatic speech recognition (ASR) are two dual tasks in speech processing and both achieve impressive performance thanks to the recent advance in deep learning and large amount of aligned speech and text data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

MASS: Masked Sequence to Sequence Pre-training for Language Generation

5 code implementations7 May 2019 Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu

Pre-training and fine-tuning, e. g., BERT, have achieved great success in language understanding by transferring knowledge from rich-resource pre-training task to the low/zero-resource downstream tasks.

Conversational Response Generation Response Generation +4

Multilingual Neural Machine Translation with Knowledge Distillation

1 code implementation ICLR 2019 Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, Tie-Yan Liu

Multilingual machine translation, which translates multiple languages with a single model, has attracted much attention due to its efficiency of offline training and online serving.

Knowledge Distillation Machine Translation +1

Non-Autoregressive Neural Machine Translation with Enhanced Decoder Input

no code implementations23 Dec 2018 Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, Tie-Yan Liu

Non-autoregressive translation (NAT) models, which remove the dependence on previous target tokens from the inputs of the decoder, achieve significantly inference speedup but at the cost of inferior accuracy compared to autoregressive translation (AT) models.

Machine Translation Translation +1

Hybrid Self-Attention Network for Machine Translation

no code implementations1 Nov 2018 Kaitao Song, Xu Tan, Furong Peng, Jianfeng Lu

The encoder-decoder is the typical framework for Neural Machine Translation (NMT), and different structures have been developed for improving the translation performance.

Machine Translation NMT +1

FRAGE: Frequency-Agnostic Word Representation

2 code implementations NeurIPS 2018 Chengyue Gong, Di He, Xu Tan, Tao Qin, Li-Wei Wang, Tie-Yan Liu

Continuous word representation (aka word embedding) is a basic building block in many neural network-based models used in natural language processing tasks.

Language Modelling Machine Translation +5

Beyond Error Propagation in Neural Machine Translation: Characteristics of Language Also Matter

no code implementations EMNLP 2018 Lijun Wu, Xu Tan, Di He, Fei Tian, Tao Qin, Jian-Huang Lai, Tie-Yan Liu

Many previous works have discussed the relationship between error propagation and the \emph{accuracy drop} (i. e., the left part of the translated sentence is often better than its right part in left-to-right decoding models) problem.

Machine Translation Text Summarization +1

Model-Level Dual Learning

no code implementations ICML 2018 Yingce Xia, Xu Tan, Fei Tian, Tao Qin, Nenghai Yu, Tie-Yan Liu

Many artificial intelligence tasks appear in dual forms like English$\leftrightarrow$French translation and speech$\leftrightarrow$text transformation.

Machine Translation Sentiment Analysis +1

Double Path Networks for Sequence to Sequence Learning

1 code implementation COLING 2018 Kaitao Song, Xu Tan, Di He, Jianfeng Lu, Tao Qin, Tie-Yan Liu

In this work we propose Double Path Networks for Sequence to Sequence learning (DPN-S2S), which leverage the advantages of both models by using double path information fusion.

Dense Information Flow for Neural Machine Translation

1 code implementation NAACL 2018 Yanyao Shen, Xu Tan, Di He, Tao Qin, Tie-Yan Liu

Recently, neural machine translation has achieved remarkable progress by introducing well-designed deep neural networks into its encoder-decoder framework.

Machine Translation NMT +1

Cannot find the paper you are looking for? You can Submit a new open access paper.