Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to downstream tasks.
We show that WavPrompt is a few-shot learner that can perform speech understanding tasks better than a naive text baseline.
An unsupervised text-to-speech synthesis (TTS) system learns to generate the speech waveform corresponding to any written sentence in a language by observing: 1) a collection of untranscribed speech waveforms in that language; 2) a collection of texts written in that language without access to any transcribed speech.
SpeechSplit can perform aspect-specific voice conversion by disentangling speech into content, rhythm, pitch, and timbre using multiple autoencoders in an unsupervised manner.
Are end-to-end text-to-speech (TTS) models over-parametrized?
In this paper, we propose AutoPST, which can disentangle global prosody style from speech without relying on any text transcriptions.
We investigate the existence of sparse subnetworks in pre-trained speech SSL models that achieve even better low-resource ASR results.
Contemporary speech enhancement predominantly relies on audio transforms that are trained to reconstruct a clean speech waveform.
Speech information can be roughly decomposed into four components: language content, timbre, pitch, and rhythm.
Recently, AutoVC, a conditional autoencoders (CAEs) based method achieved state-of-the-art results by disentangling the speaker identity and speech content using information-constraining bottlenecks, and it achieves zero-shot conversion by swapping in a different speaker's identity embedding to synthesize a new voice.
The second paradigm, called the zero-confidence attack, finds the smallest perturbation needed to cause mis-classification, also known as the margin of an input feature.
Convolutional neural network (CNN) for time series data implicitly assumes that the data are uniformly sampled, whereas many event-based and multi-modal data are nonuniform or have heterogeneous sampling rates.
On the other hand, CVAE training is simple but does not come with the distribution-matching property of a GAN.
On the other hand, deep learning based enhancement approaches are able to learn complicated speech distributions and perform efficient inference, but they are unable to deal with variable number of input channels.