In this work, we propose to incorporate KG (including both entities and relations) into the language learning process to obtain KG-enhanced pretrained Language Model, namely KLMo.
no code implementations • 8 Oct 2023 • Tianyang Zhong, Wei Zhao, Yutong Zhang, Yi Pan, Peixin Dong, Zuowei Jiang, Xiaoyan Kui, Youlan Shang, Li Yang, Yaonai Wei, Longtao Yang, Hao Chen, Huan Zhao, Yuxiao Liu, Ning Zhu, Yiwei Li, Yisong Wang, Jiaqi Yao, Jiaqi Wang, Ying Zeng, Lei He, Chao Zheng, Zhixue Zhang, Ming Li, Zhengliang Liu, Haixing Dai, Zihao Wu, Lu Zhang, Shu Zhang, Xiaoyan Cai, Xintao Hu, Shijie Zhao, Xi Jiang, Xin Zhang, Xiang Li, Dajiang Zhu, Lei Guo, Dinggang Shen, Junwei Han, Tianming Liu, Jun Liu, Tuo Zhang
Radiology report generation, as a key step in medical image analysis, is critical to the quantitative analysis of clinically informed decision-making levels.
Cameras are rapidly becoming the choice for on-board sensors towards space rendezvous due to their small form factor and inexpensive power, mass, and volume costs.
In this work, we present a system that can automatically generate high-quality audiobooks from online e-books.
In this paper, we present MuLanTTS, the Microsoft end-to-end neural text-to-speech (TTS) system designed for the Blizzard Challenge 2023.
no code implementations • 5 Sep 2023 • Yichong Leng, Zhifang Guo, Kai Shen, Xu Tan, Zeqian Ju, Yanqing Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He, Xiang-Yang Li, Sheng Zhao, Tao Qin, Jiang Bian
TTS approaches based on the text prompt face two main challenges: 1) the one-to-many problem, where not all details about voice variability can be described in the text prompt, and 2) the limited availability of text prompt datasets, where vendors and large cost of data labeling are required to write text prompts for speech.
It has long been an ill-posed problem to predict absolute depth maps from single images in real (unseen) indoor scenes.
Experiments show that ContextSpeech significantly improves the voice quality and prosody expressiveness in paragraph reading with competitive model efficiency.
The components of translator, style encoder, and style generator constitute a diversified image generator.
On the one hand, the dehazing task is an illposedness problem, which means that no unique solution exists.
To address this gap and foster future research in this area, this paper presents a comprehensive survey on the use of 4D mmWave radar in autonomous driving.
To enhance the zero-shot capability that is important to achieve diverse speech synthesis, we design a speech prompting mechanism to facilitate in-context learning in the diffusion model and the duration/pitch predictor.
The RL-GP adopts the ensemble population strategies.
Extensive experiments in simulated and real data show that our model significantly outperforms existing state-of-the-art models without learning from paired images or prior individual knowledge.
We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis.
Neural text-to-speech (TTS) generally consists of cascaded architecture with separately optimized acoustic model and vocoder, or end-to-end architecture with continuous mel-spectrograms or self-extracted speech frames as the intermediate representations to bridge acoustic model and vocoder, which suffers from two limitations: 1) the continuous acoustic frames are hard to predict with phoneme only, and acoustic information like duration or pitch is also needed to solve the one-to-many problem, which is not easy to scale on large scale and noise datasets; 2) to achieve diverse speech output based on continuous speech features, complex VAE or flow-based models are usually required.
In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis.
Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples.
In this paper, we propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation, to match the length of source and target speech.
Such approaches, however, ignore the VE's unique nature of relation inference between the premise and hypothesis.
However, direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.
We first benchmark representative SSL methods for dense prediction tasks in pathology images.
We evaluate ReMix on two public datasets with two state-of-the-art MIL methods.
In this paper, we propose a novel framework for learning style representation from abundant plain text in a self-supervised manner.
Combining this novel perspective of two-stage synthesis with advanced generative models (i. e., the diffusion models), the proposed BinauralGrad is able to generate accurate and high-fidelity binaural audio samples.
In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.
Ranked #1 on Text-To-Speech Synthesis on LJSpeech (using extra training data)
We model the speaker characteristics systematically to improve the generalization on new speakers.
Furthermore, we take the transfer results as additional training data for fluid segmentation to prove the advantage of our model indirectly, i. e., in the task of data adaptation and augmentation.
In this paper, we propose InferGrad, a diffusion model for vocoder that incorporates inference process into training, to reduce the inference iterations while maintaining high generation quality.
In this work, we propose LW-GCN, a lightweight FPGA-based accelerator with a software-hardware co-designed process to tackle irregularity in computation and memory access in GCN inference.
To further prove the ability of our method, we test it on public dataset MS COCO, and the results show that our LF-YOLO has a outstanding versatility detection performance.
The goal of this challenge is to synthesize natural and high-quality speech from text, and we approach this goal in two perspectives: The first is to directly model and generate waveform in 48 kHz sampling rate, which brings higher perception quality than previous systems with 16 kHz or 24 kHz sampling rate; The second is to model the variation information in speech through a systematic design, which improves the prosody and naturalness.
End-to-end TTS requires a large amount of speech/text paired data to cover all necessary knowledge, particularly how to pronounce different words in diverse contexts, so that a neural model may learn such knowledge accordingly.
Different from single object reconstruction from photos, this task has the unique challenge of constructing multiple objects at high resolutions.
Secondly, in these models the content/text, prosody, and speaker timbre are usually highly entangled, it's therefore not realistic to expect a satisfied result when freely combining these components, such as to transfer speaking style between speakers.
Experiments on kidney tumor segmentation task demonstrate that TumorCP surpasses the strong baseline by a remarkable margin of 7. 12% on tumor Dice.
Compared to traditional methods based on object detectors, the essential design in our work is a parallel feature difference calculation structure that infers map changes by comparing features extracted from the camera and rasterized images.
Experimental results obtained by the Transformer TTS show that the proposed BERT can extract fine-grained, segment-level prosody, which is complementary to utterance-level prosody to improve the final prosody of the TTS speech.
The first challenge is solved with a splicing data method which concatenates the speech segments extracted from the source domain data.
Machine Speech Chain, which integrates both end-to-end (E2E) automatic speech recognition (ASR) and text-to-speech (TTS) into one circle for joint training, has been proven to be effective in data augmentation by leveraging large amounts of unpaired data.
To scale neural speech synthesis to various real-world languages, we present a multilingual end-to-end framework that maps byte inputs to spectrograms, thus allowing arbitrary input scripts.
Second, we can largely boost the robustness of existing ConvNets, proved by: (i) testing on scans with synthetic pathologies, and (ii) training and evaluation on scans of different scanning setups across datasets.
In this paper, we first introduce the concept of semantic objectness to exploit the geometric relationship of these two tasks through an analysis of the imaging process, then propose a Semantic Object Segmentation and Depth Estimation Network (SOSD-Net) based on the objectness assumption.
Ranked #77 on Semantic Segmentation on NYU Depth v2
The ability of deep learning to predict with uncertainty is recognized as key for its adoption in clinical routines.
Because of its streaming nature, recurrent neural network transducer (RNN-T) is a very promising end-to-end (E2E) model that may replace the popular hybrid model for automatic speech recognition.
In this paper, we propose a framework, named Oral-3D, to reconstruct the 3D oral cavity from a single PX image and prior information of the dental arch.
The encoder-decoder network is widely used to learn deep feature representations from pixel-wise annotations in biomedical image analysis.
As an emerging technology, blockchain has achieved great success in numerous application scenarios, from intelligent healthcare to smart cities.
Cryptography and Security Distributed, Parallel, and Cluster Computing 68M14 C.2.2
Sparrow integrates the exploration ability of BRKGA and the exploitation ability of ALNS.
Inspired by the great success of convolutional neural networks on structural data like videos and images, graph neural network (GNN) emerges as a powerful approach to process non-euclidean data structures and has been proved powerful in various application domains such as social network, e-commerce, and knowledge graph.
Distributed, Parallel, and Cluster Computing
Experimental results show our proposed methods especially the second one (bidirectional decoder regularization), leads a significantly improvement on both robustness and overall naturalness, as outperforming baseline (the revised version of Tacotron2) with a MOS gap of 0. 14 in a challenging test, and achieving close to human quality (4. 42 vs. 4. 49 in MOS) on general test.
In this paper, we propose a novel stepwise monotonic attention method in sequence-to-sequence acoustic modeling to improve the robustness on out-of-domain inputs.
The end-to-end TTS, which can predict speech directly from a given sequence of graphemes or phonemes, has shown improved performance over the conventional TTS.
However, the autoregressive module training is affected by the exposure bias, or the mismatch between the different distributions of real and predicted data.
In this paper, we propose a feature reinforcement method under the sequence-to-sequence neural text-to-speech (TTS) synthesis framework.
In recent years, Recurrent Neural Networks (RNNs) based models have been applied to the Slot Filling problem of Spoken Language Understanding and achieved the state-of-the-art performances.
In this paper, we introduce the Variational Autoencoder (VAE) to an end-to-end speech synthesis model, to learn the latent representation of speaking styles in an unsupervised manner.
In order to learn monocular depth by embedding the focal length, we propose a method to generate synthetic varying-focal-length dataset from fixed-focal-length datasets, and a simple and effective method is implemented to fill the holes in the newly generated images.
This paper studies the abstractive multi-document summarization for event-oriented news texts through event information extraction and abstract representation.
Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has been shown to be very effective for modeling and predicting sequential data, e. g. speech utterances or handwritten documents.
Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has been shown to be very effective for tagging sequential data, e. g. speech utterances or handwritten documents.
The convergence rate of the proposed algorithm is almost the same as that of the traditional IRLS algorithms, that is, exponentially fast.
In this paper, we propose a novel algorithm for structured sparsity reconstruction.