Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech owing to the rapid development of parallel TTS systems, but producing speech with naturalistic prosodic variations, speaking styles and emotional tones remains challenging.
We show that training causes these integration windows to shrink at early layers and expand at higher layers, creating a hierarchy of integration windows across the network.
We present an unsupervised non-parallel many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2.
Leveraging additional speaker information to facilitate speech separation has received increasing attention in recent years.
A context codec module, containing a context encoder and a context decoder, is designed as a learnable downsampling and upsampling module to decrease the length of a sequential feature processed by the separation module.
An important problem in ad-hoc microphone speech separation is how to guarantee the robustness of a system with respect to the locations and numbers of microphones.
Beamforming has been extensively investigated for multi-channel audio processing tasks.
The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time-frequency representation for speech separation, and the long latency in calculating the spectrograms.
Ranked #2 on Multi-task Audio Source Seperation on MTASS
We investigate the recently proposed Time-domain Audio Sep-aration Network (TasNet) in the task of real-time single-channel speech dereverberation.
Ranked #19 on Speech Separation on WSJ0-2mix
We directly model the signal in the time-domain using an encoder-decoder framework and perform the source separation on nonnegative encoder outputs.
Ranked #21 on Speech Separation on WSJ0-2mix
In this study, we propose a deep neural network for reconstructing intelligible speech from silent lip movement videos.
Despite the recent success of deep learning, the nature of the transformations they apply to the input features remains poorly understood.
A reference point attractor is created in the embedding space to represent each speaker which is defined as the centroid of the speaker in the embedding space.
We propose a novel deep learning framework for single channel speech separation by creating attractor points in high dimensional embedding space of the acoustic signals which pull together the time-frequency bins corresponding to each source.
Deep clustering is the first method to handle general audio separation scenarios with multiple sources of the same type and an arbitrary number of sources, performing impressively in speaker-independent speech separation tasks.