To this end, we propose using an ensemble model wherein each specialist module denoises noisy utterances from a distinct partition of training set speakers.
To this end, we pose personalization as either a zero-shot task, in which no additional clean speech of the target speaker is used for training, or a few-shot learning task, in which the goal is to minimize the duration of the clean speech used for transfer learning.
Training personalized speech enhancement models is innately a no-shot learning problem due to privacy constraints and limited access to noise-free speech from the target user.
Podcast episodes often contain material extraneous to the main content, such as advertisements, interleaved within the audio and the written descriptions.
This work explores how self-supervised learning can be universally used to discover speaker-specific features towards enabling personalized speech enhancement models.
In this paper, we investigate a deep learning approach for speech denoising through an efficient ensemble of specialist neural networks.
This approach differs from commercially used automatic pitch correction systems, where notes in the vocal tracks are shifted to be centered around notes in a user-defined score or mapped to the closest pitch among the twelve equal-tempered scale degrees.