Search Results for author: Lei He

Found 73 papers, 23 papers with code

KLMo: Knowledge Graph Enhanced Pretrained Language Model with Fine-Grained Relationships

1 code implementation Findings (EMNLP) 2021 Lei He, Suncong Zheng, Tao Yang, Feng Zhang

In this work, we propose to incorporate KG (including both entities and relations) into the language learning process to obtain KG-enhanced pretrained Language Model, namely KLMo.

Entity Linking Entity Typing +4

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

no code implementations5 Mar 2024 Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao

Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt.

Quantization Speech Synthesis

StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis

no code implementations19 Dec 2023 Xueyuan Chen, Xi Wang, Shaofei Zhang, Lei He, Zhiyong Wu, Xixin Wu, Helen Meng

Both objective and subjective evaluations demonstrate that our proposed method can effectively improve the naturalness and expressiveness of the synthesized speech in audiobook synthesis especially for the role and out-of-domain scenarios.

Speech Synthesis

Orbital AI-based Autonomous Refuelling Solution

no code implementations20 Sep 2023 Duarte Rondao, Lei He, Nabil Aouf

Cameras are rapidly becoming the choice for on-board sensors towards space rendezvous due to their small form factor and inexpensive power, mass, and volume costs.

Large-Scale Automatic Audiobook Creation

no code implementations7 Sep 2023 Brendan Walsh, Mark Hamilton, Greg Newby, Xi Wang, Serena Ruan, Sheng Zhao, Lei He, Shaofei Zhang, Eric Dettinger, William T. Freeman, Markus Weimer

In this work, we present a system that can automatically generate high-quality audiobooks from online e-books.

MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023

no code implementations6 Sep 2023 Zhihang Xu, Shaofei Zhang, Xi Wang, Jiajun Zhang, Wenning Wei, Lei He, Sheng Zhao

In this paper, we present MuLanTTS, the Microsoft end-to-end neural text-to-speech (TTS) system designed for the Blizzard Challenge 2023.

Speech Synthesis

PromptTTS 2: Describing and Generating Voices with Text Prompt

no code implementations5 Sep 2023 Yichong Leng, Zhifang Guo, Kai Shen, Xu Tan, Zeqian Ju, Yanqing Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He, Xiang-Yang Li, Sheng Zhao, Tao Qin, Jiang Bian

TTS approaches based on the text prompt face two main challenges: 1) the one-to-many problem, where not all details about voice variability can be described in the text prompt, and 2) the limited availability of text prompt datasets, where vendors and large cost of data labeling are required to write text prompts for speech.

Language Modelling Large Language Model

FS-Depth: Focal-and-Scale Depth Estimation from a Single Image in Unseen Indoor Scene

no code implementations27 Jul 2023 Chengrui Wei, Meng Yang, Lei He, Nanning Zheng

It has long been an ill-posed problem to predict absolute depth maps from single images in real (unseen) indoor scenes.

3D Reconstruction Data Augmentation +1

ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading

no code implementations3 Jul 2023 Yujia Xiao, Shaofei Zhang, Xi Wang, Xu Tan, Lei He, Sheng Zhao, Frank K. Soong, Tan Lee

Experiments show that ContextSpeech significantly improves the voice quality and prosody expressiveness in paragraph reading with competitive model efficiency.

Sentence

Progressive Energy-Based Cooperative Learning for Multi-Domain Image-to-Image Translation

no code implementations26 Jun 2023 Weinan Song, Yaxuan Zhu, Lei He, YingNian Wu, Jianwen Xie

The components of translator, style encoder, and style generator constitute a diversified image generator.

Image-to-Image Translation

4D Millimeter-Wave Radar in Autonomous Driving: A Survey

no code implementations7 Jun 2023 Zeyu Han, Jiahao Wang, Zikun Xu, Shuocheng Yang, Lei He, Shaobing Xu, Jianqiang Wang, Keqiang Li

In an effort to bridge this gap and stimulate future research, this paper presents an exhaustive survey on the utilization of 4D mmWave radar in autonomous driving.

Autonomous Driving Point Cloud Generation

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

1 code implementation18 Apr 2023 Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, Jiang Bian

To enhance the zero-shot capability that is important to achieve diverse speech synthesis, we design a speech prompting mechanism to facilitate in-context learning in the diffusion model and the duration/pitch predictor.

In-Context Learning Speech Synthesis

Oral-3Dv2: 3D Oral Reconstruction from Panoramic X-Ray Imaging with Implicit Neural Representation

no code implementations21 Mar 2023 Weinan Song, Haoxin Zheng, Dezhan Tu, Chengwen Liang, Lei He

Extensive experiments in simulated and real data show that our model significantly outperforms existing state-of-the-art models without learning from paired images or prior individual knowledge.

3D Reconstruction

FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model

no code implementations6 Mar 2023 Ruiqing Xue, Yanqing Liu, Lei He, Xu Tan, Linquan Liu, Edward Lin, Sheng Zhao

Neural text-to-speech (TTS) generally consists of cascaded architecture with separately optimized acoustic model and vocoder, or end-to-end architecture with continuous mel-spectrograms or self-extracted speech frames as the intermediate representations to bridge acoustic model and vocoder, which suffers from two limitations: 1) the continuous acoustic frames are hard to predict with phoneme only, and acoustic information like duration or pitch is also needed to solve the one-to-many problem, which is not easy to scale on large scale and noise datasets; 2) to achieve diverse speech output based on continuous speech features, complex VAE or flow-based models are usually required.

Language Modelling Large Language Model +1

ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech

1 code implementation30 Dec 2022 Zehua Chen, Yihan Wu, Yichong Leng, Jiawei Chen, Haohe Liu, Xu Tan, Yang Cui, Ke Wang, Lei He, Sheng Zhao, Jiang Bian, Danilo Mandic

Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples.

Denoising

VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing

1 code implementation30 Nov 2022 Yihan Wu, Junliang Guo, Xu Tan, Chen Zhang, Bohan Li, Ruihua Song, Lei He, Sheng Zhao, Arul Menezes, Jiang Bian

In this paper, we propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation, to match the length of source and target speech.

Machine Translation Sentence +4

AlignVE: Visual Entailment Recognition Based on Alignment Relations

no code implementations16 Nov 2022 Biwei Cao, Jiuxin Cao, Jie Gui, Jiayun Shen, Bo Liu, Lei He, Yuan Yan Tang, James Tin-Yau Kwok

Such approaches, however, ignore the VE's unique nature of relation inference between the premise and hypothesis.

Question Answering Relation +2

Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

1 code implementation31 Oct 2022 Kun Wei, Long Zhou, Ziqiang Zhang, Liping Chen, Shujie Liu, Lei He, Jinyu Li, Furu Wei

However, direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.

Speech-to-Speech Translation Translation

Self-supervised Context-aware Style Representation for Expressive Speech Synthesis

no code implementations25 Jun 2022 Yihan Wu, Xi Wang, Shaofei Zhang, Lei He, Ruihua Song, Jian-Yun Nie

In this paper, we propose a novel framework for learning style representation from abundant plain text in a self-supervised manner.

Contrastive Learning Deep Clustering +2

BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis

1 code implementation30 May 2022 Yichong Leng, Zehua Chen, Junliang Guo, Haohe Liu, Jiawei Chen, Xu Tan, Danilo Mandic, Lei He, Xiang-Yang Li, Tao Qin, Sheng Zhao, Tie-Yan Liu

Combining this novel perspective of two-stage synthesis with advanced generative models (i. e., the diffusion models), the proposed BinauralGrad is able to generate accurate and high-fidelity binaural audio samples.

Audio Synthesis

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

3 code implementations9 May 2022 Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, YuanHao Yi, Lei He, Frank Soong, Tao Qin, Sheng Zhao, Tie-Yan Liu

In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.

 Ranked #1 on Text-To-Speech Synthesis on LJSpeech (using extra training data)

Sentence Speech Synthesis +1

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios

no code implementations1 Apr 2022 Yihan Wu, Xu Tan, Bohan Li, Lei He, Sheng Zhao, Ruihua Song, Tao Qin, Tie-Yan Liu

We model the speaker characteristics systematically to improve the generalization on new speakers.

Speech Synthesis

MDT-Net: Multi-domain Transfer by Perceptual Supervision for Unpaired Images in OCT Scan

no code implementations12 Mar 2022 Weinan Song, Gaurav Fotedar, Nima Tajbakhsh, Ziheng Zhou, Lei He, Xiaowei Ding

Furthermore, we take the transfer results as additional training data for fluid segmentation to prove the advantage of our model indirectly, i. e., in the task of data adaptation and augmentation.

Anatomy Data Augmentation +2

InferGrad: Improving Diffusion Models for Vocoder by Considering Inference in Training

no code implementations8 Feb 2022 Zehua Chen, Xu Tan, Ke Wang, Shifeng Pan, Danilo Mandic, Lei He, Sheng Zhao

In this paper, we propose InferGrad, a diffusion model for vocoder that incorporates inference process into training, to reduce the inference iterations while maintaining high generation quality.

Denoising

Exploring Forensic Dental Identification with Deep Learning

1 code implementation NeurIPS 2021 Yuan Liang, Weikun Han, Liang Qiu, Chen Wu, Yiting shao, Kun Wang, Lei He

In this work, we pioneer to study deep learning for dental forensic identification based on panoramic radiographs.

LW-GCN: A Lightweight FPGA-based Graph Convolutional Network Accelerator

no code implementations4 Nov 2021 Zhuofu Tao, Chen Wu, Yuan Liang, Lei He

In this work, we propose LW-GCN, a lightweight FPGA-based accelerator with a software-hardware co-designed process to tackle irregularity in computation and memory access in GCN inference.

Quantization

LF-YOLO: A Lighter and Faster YOLO for Weld Defect Detection of X-ray Image

1 code implementation28 Oct 2021 Moyun Liu, Youping Chen, Lei He, Yang Zhang, Jingming Xie

To further prove the ability of our method, we test it on public dataset MS COCO, and the results show that our LF-YOLO has a outstanding versatility detection performance.

Defect Detection

DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021

1 code implementation25 Oct 2021 Yanqing Liu, Zhihang Xu, Gang Wang, Kuan Chen, Bohan Li, Xu Tan, Jinzhu Li, Lei He, Sheng Zhao

The goal of this challenge is to synthesize natural and high-quality speech from text, and we approach this goal in two perspectives: The first is to directly model and generate waveform in 48 kHz sampling rate, which brings higher perception quality than previous systems with 16 kHz or 24 kHz sampling rate; The second is to model the variation information in speech through a systematic design, which improves the prosody and naturalness.

Speech Synthesis

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

no code implementations19 Oct 2021 Mutian He, Jingzhou Yang, Lei He, Frank K. Soong

End-to-end TTS requires a large amount of speech/text paired data to cover all necessary knowledge, particularly how to pronounce different words in diverse contexts, so that a neural model may learn such knowledge accordingly.

X2Teeth: 3D Teeth Reconstruction from a Single Panoramic Radiograph

no code implementations30 Aug 2021 Yuan Liang, Weinan Song, Jiawei Yang, Liang Qiu, Kun Wang, Lei He

Different from single object reconstruction from photos, this task has the unique challenge of constructing multiple objects at high resolutions.

3D Reconstruction Anatomy +2

Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis

no code implementations27 Jul 2021 Shifeng Pan, Lei He

Secondly, in these models the content/text, prosody, and speaker timbre are usually highly entangled, it's therefore not realistic to expect a satisfied result when freely combining these components, such as to transfer speaking style between speakers.

Expressive Speech Synthesis Style Transfer

TumorCP: A Simple but Effective Object-Level Data Augmentation for Tumor Segmentation

1 code implementation21 Jul 2021 Jiawei Yang, Yao Zhang, Yuan Liang, Yang Zhang, Lei He, Zhiqiang He

Experiments on kidney tumor segmentation task demonstrate that TumorCP surpasses the strong baseline by a remarkable margin of 7. 12% on tumor Dice.

Data Augmentation Tumor Segmentation

Diff-Net: Image Feature Difference based High-Definition Map Change Detection for Autonomous Driving

no code implementations14 Jul 2021 Lei He, Shengjie Jiang, Xiaoqing Liang, Ning Wang, Shiyu Song

Compared to traditional methods based on object detectors, the essential design in our work is a parallel feature difference calculation structure that infers map changes by comparing features extracted from the camera and rasterized images.

Autonomous Driving Change Detection +3

Speech BERT Embedding For Improving Prosody in Neural TTS

no code implementations8 Jun 2021 Liping Chen, Yan Deng, Xi Wang, Frank K. Soong, Lei He

Experimental results obtained by the Transformer TTS show that the proposed BERT can extract fine-grained, segment-level prosody, which is complementary to utterance-level prosody to improve the final prosody of the TTS speech.

On Addressing Practical Challenges for RNN-Transducer

no code implementations27 Apr 2021 Rui Zhao, Jian Xue, Jinyu Li, Wenning Wei, Lei He, Yifan Gong

The first challenge is solved with a splicing data method which concatenates the speech segments extracted from the source domain data.

speech-recognition Speech Recognition

Exploring Machine Speech Chain for Domain Adaptation and Few-Shot Speaker Adaptation

no code implementations8 Apr 2021 Fengpeng Yue, Yan Deng, Lei He, Tom Ko

Machine Speech Chain, which integrates both end-to-end (E2E) automatic speech recognition (ASR) and text-to-speech (TTS) into one circle for joint training, has been proven to be effective in data augmentation by leveraging large amounts of unpaired data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Multilingual Byte2Speech Models for Scalable Low-resource Speech Synthesis

2 code implementations5 Mar 2021 Mutian He, Jingzhou Yang, Lei He, Frank K. Soong

To scale neural speech synthesis to various real-world languages, we present a multilingual end-to-end framework that maps byte inputs to spectrograms, thus allowing arbitrary input scripts.

Speech Synthesis

Atlas-aware ConvNetfor Accurate yet Robust Anatomical Segmentation

no code implementations2 Feb 2021 Yuan Liang, Weinan Song, Jiawei Yang, Liang Qiu, Kun Wang, Lei He

Second, we can largely boost the robustness of existing ConvNets, proved by: (i) testing on scans with synthetic pathologies, and (ii) training and evaluation on scans of different scanning setups across datasets.

SOSD-Net: Joint Semantic Object Segmentation and Depth Estimation from Monocular images

no code implementations19 Jan 2021 Lei He, Jiwen Lu, Guanghui Wang, Shiyu Song, Jie zhou

In this paper, we first introduce the concept of semantic objectness to exploit the geometric relationship of these two tasks through an analysis of the imaging process, then propose a Semantic Object Segmentation and Depth Estimation Network (SOSD-Net) based on the objectness assumption.

Monocular Depth Estimation Multi-Task Learning +3

Exploring Instance-Level Uncertainty for Medical Detection

no code implementations23 Dec 2020 Jiawei Yang, Yuan Liang, Yao Zhang, Weinan Song, Kun Wang, Lei He

The ability of deep learning to predict with uncertainty is recognized as key for its adoption in clinical routines.

Lung Nodule Detection

Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability

no code implementations30 Jul 2020 Jinyu Li, Rui Zhao, Zhong Meng, Yanqing Liu, Wenning Wei, Sarangarajan Parthasarathy, Vadim Mazalov, Zhenghao Wang, Lei He, Sheng Zhao, Yifan Gong

Because of its streaming nature, recurrent neural network transducer (RNN-T) is a very promising end-to-end (E2E) model that may replace the popular hybrid model for automatic speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Accurate Anchor Free Tracking

no code implementations13 Jun 2020 Shengyun Peng, Yunxuan Yu, Kun Wang, Lei He

Specifically, a target object is defined by a bounding box center, tracking offset, and object size.

Object Visual Object Tracking

Oral-3D: Reconstructing the 3D Bone Structure of Oral Cavity from 2D Panoramic X-ray

no code implementations18 Mar 2020 Weinan Song, Yuan Liang, Jiawei Yang, Kun Wang, Lei He

In this paper, we propose a framework, named Oral-3D, to reconstruct the 3D oral cavity from a single PX image and prior information of the dental arch.

3D Reconstruction

T-Net: Learning Feature Representation with Task-specific Supervision for Biomedical Image Analysis

no code implementations19 Feb 2020 Weinan Song, Yuan Liang, Jiawei Yang, Kun Wang, Lei He

The encoder-decoder network is widely used to learn deep feature representations from pixel-wise annotations in biomedical image analysis.

Region Proposal Representation Learning

Effective Scaling of Blockchain Beyond Consensus Innovations and Moore's Law

no code implementations7 Jan 2020 Yinqiu Liu, Kai Qian, Jianli Chen, Kun Wang, Lei He

As an emerging technology, blockchain has achieved great success in numerous application scenarios, from intelligent healthcare to smart cities.

Cryptography and Security Distributed, Parallel, and Cluster Computing 68M14 C.2.2

EnGN: A High-Throughput and Energy-Efficient Accelerator for Large Graph Neural Networks

no code implementations31 Aug 2019 Lei He

Inspired by the great success of convolutional neural networks on structural data like videos and images, graph neural network (GNN) emerges as a powerful approach to process non-euclidean data structures and has been proved powerful in various application domains such as social network, e-commerce, and knowledge graph.

Distributed, Parallel, and Cluster Computing

Forward-Backward Decoding for Regularizing End-to-End TTS

1 code implementation18 Jul 2019 Yibin Zheng, Xi Wang, Lei He, Shifeng Pan, Frank K. Soong, Zhengqi Wen, Jian-Hua Tao

Experimental results show our proposed methods especially the second one (bidirectional decoder regularization), leads a significantly improvement on both robustness and overall naturalness, as outperforming baseline (the revised version of Tacotron2) with a MOS gap of 0. 14 in a challenging test, and achieving close to human quality (4. 42 vs. 4. 49 in MOS) on general test.

Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS

1 code implementation3 Jun 2019 Mutian He, Yan Deng, Lei He

In this paper, we propose a novel stepwise monotonic attention method in sequence-to-sequence acoustic modeling to improve the robustness on out-of-domain inputs.

Hard Attention

A New GAN-based End-to-End TTS Training Algorithm

no code implementations9 Apr 2019 Haohan Guo, Frank K. Soong, Lei He, Lei Xie

However, the autoregressive module training is affected by the exposure bias, or the mismatch between the different distributions of real and predicted data.

Generative Adversarial Network Sentence +1

Exploiting Syntactic Features in a Parsed Tree to Improve End-to-End TTS

no code implementations9 Apr 2019 Haohan Guo, Frank K. Soong, Lei He, Lei Xie

The end-to-end TTS, which can predict speech directly from a given sequence of graphemes or phonemes, has shown improved performance over the conventional TTS.

Sentence

Feature reinforcement with word embedding and parsing information in neural TTS

no code implementations3 Jan 2019 Huaiping Ming, Lei He, Haohan Guo, Frank K. Soong

In this paper, we propose a feature reinforcement method under the sequence-to-sequence neural text-to-speech (TTS) synthesis framework.

Sentence

Recurrent Neural Networks with Pre-trained Language Model Embedding for Slot Filling Task

1 code implementation12 Dec 2018 Liang Qiu, Yuanyi Ding, Lei He

In recent years, Recurrent Neural Networks (RNNs) based models have been applied to the Slot Filling problem of Spoken Language Understanding and achieved the state-of-the-art performances.

Language Modelling slot-filling +2

Learning latent representations for style control and transfer in end-to-end speech synthesis

2 code implementations11 Dec 2018 Ya-Jie Zhang, Shifeng Pan, Lei He, Zhen-Hua Ling

In this paper, we introduce the Variational Autoencoder (VAE) to an end-to-end speech synthesis model, to learn the latent representation of speaking styles in an unsupervised manner.

Speech Synthesis Style Transfer

Learning Depth from Single Images with Deep Neural Network Embedding Focal Length

no code implementations27 Mar 2018 Lei He, Guanghui Wang, Zhanyi Hu

In order to learn monocular depth by embedding the focal length, we propose a method to generate synthetic varying-focal-length dataset from fixed-focal-length datasets, and a simple and effective method is implemented to fill the holes in the newly generated images.

Depth Estimation Network Embedding +1

Abstractive News Summarization based on Event Semantic Link Network

no code implementations COLING 2016 Wei Li, Lei He, Hai Zhuge

This paper studies the abstractive multi-document summarization for event-oriented news texts through event information extraction and abstract representation.

Abstractive Text Summarization Document Summarization +2

A Unified Tagging Solution: Bidirectional LSTM Recurrent Neural Network with Word Embedding

no code implementations1 Nov 2015 Peilu Wang, Yao Qian, Frank K. Soong, Lei He, Hai Zhao

Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has been shown to be very effective for modeling and predicting sequential data, e. g. speech utterances or handwritten documents.

Chunking Feature Engineering +4

Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network

4 code implementations21 Oct 2015 Peilu Wang, Yao Qian, Frank K. Soong, Lei He, Hai Zhao

Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has been shown to be very effective for tagging sequential data, e. g. speech utterances or handwritten documents.

Part-Of-Speech Tagging POS +1

Fast Iteratively Reweighted Least Squares Algorithms for Analysis-Based Sparsity Reconstruction

no code implementations18 Nov 2014 Chen Chen, Junzhou Huang, Lei He, Hongsheng Li

The convergence rate of the proposed algorithm is almost the same as that of the traditional IRLS algorithms, that is, exponentially fast.

Compressive Sensing

Cannot find the paper you are looking for? You can Submit a new open access paper.