The I2V model is designed to produce videos that strictly adhere to the content of the provided reference image, preserving its content, structure, and style.
This paper introduces an improved duration informed attention neural network (DurIAN-E) for expressive and high-fidelity text-to-speech (TTS) synthesis.
However, it is still challenging to train a universal vocoder which can generalize well to out-of-domain (OOD) scenarios, such as unseen speaking styles, non-speech vocalization, singing, and musical pieces.
Noise-robust TTS models are often trained using the enhanced speech, which thus suffer from speech distortion and background noise that affect the quality of the synthesized speech.
Echo cancellation and noise reduction are essential for full-duplex communication, yet most existing neural networks have high computational costs and are inflexible in tuning model complexity.
While the vanilla transducer does not have a prior preference for any of the valid paths, this work intends to enforce the preferred paths and achieve controllable alignment prediction.
For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure.
Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common.
Recently, many efforts have been made to explore how the brain processes speech using electroencephalographic (EEG) signals, where deep learning-based approaches were shown to be applicable in this field.
Despite the rapid progress in speech enhancement (SE) research, enhancing the quality of desired speech in environments with strong noise and interfering speakers remains challenging.
Besides predicting the target sequence, a side product of CTC is to predict the alignment, which is the most probable input-long sequence that specifies a hard aligning relationship between the input and target units.
In this study, we investigate generating sound conditioned on a text prompt and propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder.
Ranked #10 on Audio Generation on AudioCaps
In this paper, we mine cross-age test sets based on the VoxCeleb dataset and propose our age-invariant speaker representation(AISR) learning method.
Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level and shows superior performance on both monolingual and multilingual ASR tasks.
However, the effectiveness and efficiency of the MBR-based methods are compromised: the MBR criterion is only used in system training, which creates a mismatch between training and decoding; the on-the-fly decoding process in MBR-based methods results in the need for pre-trained models and slow training speeds.
This paper describes our speaker diarization system submitted to the Multi-channel Multi-party Meeting Transcription (M2MeT) challenge, where Mandarin meeting data were recorded in multi-channel format for diarization and automatic speech recognition (ASR) tasks.
Then, the LM score of the hypothesis is obtained by intersecting the generated lattice with an external word N-gram LM.
Recently, End-to-End (E2E) frameworks have achieved remarkable results on various Automatic Speech Recognition (ASR) tasks.
Conversational bilingual speech encompasses three types of utterances: two purely monolingual types and one intra-sententially code-switched type.
2 code implementations • 13 Jun 2021 • Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, Zhiyong Yan
This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10, 000 hours of high quality labeled audio suitable for supervised training, and 40, 000 hours of total audio suitable for semi-supervised and unsupervised training.
Ranked #1 on Speech Recognition on GigaSpeech
However, state-of-the-art context modeling methods in conversational TTS only model the textual information in context with a recurrent neural network (RNN).
End-to-end speech recognition generally uses hand-engineered acoustic features as input and excludes the feature extraction module from its joint optimization.
In this paper, we exploit the effective way to leverage contextual information to improve the speech dereverberation performance in real-world reverberant environments.
This paper proposes the target speaker enhancement based speaker verification network (TASE-SVNet), an all neural model that couples target speaker enhancement and speaker embedding extraction for robust speaker verification (SV).
In addition to using the prediction error as a metric for evaluating our localization model, we also establish its potency as a frontend with automatic speech recognition (ASR) as the downstream task.
This paper proposes VARA-TTS, a non-autoregressive (non-AR) text-to-speech (TTS) model using a very deep Variational Autoencoder (VDVAE) with Residual Attention mechanism, which refines the textual-to-acoustic alignment layer-wisely.
First, we examine a simple contrastive learning approach (SimCLR) with a momentum contrastive (MoCo) learning framework, where the MoCo speaker embedding system utilizes a queue to maintain a large set of negative examples.
Target-speaker speech recognition aims to recognize target-speaker speech from noisy environments with background noise and interfering speakers.
The advantages of D-ASR over existing methods are threefold: (1) it provides explicit speaker locations, (2) it improves the explainability factor, and (3) it achieves better ASR performance as the process is more streamlined.
This multiple scaling mechanism significantly improves the countermeasure's generalizability to unseen spoofing attacks.
Non-autoregressive (NAR) transformer models have achieved significantly inference speedup but at the cost of inferior accuracy compared to autoregressive (AR) models in automatic speech recognition (ASR).
In this work, we propose to deal with this issue and synthesize expressive Peking Opera singing from the music score based on the Duration Informed Attention Network (DurIAN) framework.
Purely neural network (NN) based speech separation and enhancement methods, although can achieve good objective scores, inevitably cause nonlinear speech distortions that are harmful for the automatic speech recognition (ASR).
Audio and Speech Processing Sound
This paper presents a method that generates expressive singing voice of Peking opera.
The proposed algorithm first integrate speech and singing synthesis into a unified framework, and learns universal speaker embeddings that are shareable between speech and singing synthesis tasks.
However, the converted singing voice can be easily out of key, showing that the existing approach cannot model the pitch information precisely.
In this work, we propose minimum Bayes risk (MBR) training of RNN-Transducer (RNN-T) for end-to-end speech recognition.
Self-attention networks (SAN) have been introduced into automatic speech recognition (ASR) and achieved state-of-the-art performance owing to its superior ability in capturing long term dependency.
In this paper, we present a generic and robust multimodal synthesis system that produces highly natural speech and facial expression simultaneously.
In this work, three lattice-free (LF) discriminative training criteria for purely sequence-trained neural network acoustic models are compared on LVCSR tasks, namely maximum mutual information (MMI), boosted maximum mutual information (bMMI) and state-level minimum Bayes risk (sMBR).