To address these issues, we propose a minimally-supervised high-fidelity speech synthesis method, where all modules are constructed based on the diffusion models.
However, existing contrastive learning methods in the audio field focus on extracting global descriptive information for downstream audio classification tasks, making them unsuitable for TTS, VC, and ASR tasks.
However, existing methods suffer from three problems: the high dimensionality and waveform distortion of discrete speech representations, the prosodic averaging problem caused by the duration prediction model in non-autoregressive frameworks, and the information redundancy and dimension explosion problems of existing semantic encoding methods.
Recently, stunning improvements on multi-channel speech separation have been achieved by neural beamformers when direction information is available.
Visual speech (i. e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production.
Recently, many deep learning based beamformers have been proposed for multi-channel speech separation.
no code implementations • 2 Nov 2022 • Kong Aik Lee, Tomi Kinnunen, Daniele Colibro, Claudio Vair, Andreas Nautsch, Hanwu Sun, Liang He, Tianyu Liang, Qiongqiong Wang, Mickael Rouvier, Pierre-Michel Bousquet, Rohan Kumar Das, Ignacio Viñals Bailo, Meng Liu, Héctor Deldago, Xuechen Liu, Md Sahidullah, Sandro Cumani, Boning Zhang, Koji Okabe, Hitoshi Yamamoto, Ruijie Tao, Haizhou Li, Alfonso Ortega Giménez, Longbiao Wang, Luis Buera
This manuscript describes the I4U submission to the 2020 NIST Speaker Recognition Evaluation (SRE'20) Conversational Telephone Speech (CTS) Challenge.
It has two stages: the speech awareness (SA) stage and the language fusion (LF) stage.
The Audio Deep Synthesis Detection (ADD) Challenge has been held to detect generated human-like speech.
In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence.
These algorithms are usually achieved by mapping the multi-channel audio input to the single output (i. e. overall spatial pseudo-spectrum (SPS) of all sources), that is called MISO.
Dual-encoder structure successfully utilizes two language-specific encoders (LSEs) for code-switching speech recognition.
Sound source localization aims to seek the direction of arrival (DOA) of all sound sources from the observed multi-channel audio.
We utilize pre-trained AU classifier to ensure that the generated images contain correct AU information.
Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance.
Quantitative and qualitative experiments demonstrate that our method outperforms existing methods in both image quality and lip-sync accuracy.
The end-to-end speech synthesis model can directly take an utterance as reference audio, and generate speech from the text with prosody and speaker characteristics similar to the reference audio.
Audio-visual (AV) lip biometrics is a promising authentication technique that leverages the benefits of both the audio and visual modalities in speech communication.
To eliminate such mismatch, we propose a complete time-domain speaker extraction solution, that is called SpEx+.
Ranked #1 on Speech Extraction on WSJ0-2mix-extr
Speech Extraction Audio and Speech Processing Sound
They are also believed to play an essential role in low-power consumption of the biological systems, whose efficiency attracts increasing attentions to the field of neuromorphic computing.
However, the sampled information from latent space usually becomes useless due to the KL divergence vanishing issue, and the highly abstractive global variables easily dilute the personal features of replier, leading to a non replier-specific response.
Most existing AU detection works considering AU relationships are relying on probabilistic graphical models with manually extracted features.
Our framework is a unifying system with a consistent integration of three major functional parts which are sparse encoding, efficient learning and robust readout.
They ignore that one discusses diverse topics when dynamically interacting with different people.
In this paper, we propose a novel neural Tensor network framework with Interactive Attention and Sparse Learning (TIASL) for implicit discourse relation recognition.
Recently, increasing attention has been directed to the study of the speech emotion recognition, in which global acoustic features of an utterance are mostly used to eliminate the content differences.