Pitch estimation is an essential step of many speech processing algorithms, including speech coding, synthesis, and enhancement.
Recent approaches in source separation leverage semantic information about their input mixtures and constituent sources that when used in conditional separation models can achieve impressive performance.
In this paper, we study unsupervised approaches to improve the learning framework of such representations with unpaired text and audio.
In this study, we present an approach to train a single speech enhancement network that can perform both personalized and non-personalized speech enhancement.
GAN vocoders are currently one of the state-of-the-art methods for building high-quality neural waveform generative models.
During inference, we can dynamically adjust how many processing blocks and iterations of a specific block an input signal needs using a gating module.
Recent research has shown remarkable performance in leveraging multiple extraneous conditional and non-mutually exclusive semantic concepts for sound source separation, allowing the flexibility to extract a given target source based on multiple different queries.
In this work, we propose Exformer, a time-domain architecture for target speaker extraction.
In this paper, we work on a sound recognition system that continually incorporates new sound classes.
As deep speech enhancement algorithms have recently demonstrated capabilities greatly surpassing their traditional counterparts for suppressing noise, reverberation and echo, attention is turning to the problem of packet loss concealment (PLC).
We introduce a new paradigm for single-channel target source separation where the sources of interest can be distinguished using non-mutually exclusive concepts (e. g., loudness, gender, language, spatial location, etc).
Neural vocoders have recently demonstrated high quality speech synthesis, but typically require a high computational complexity.
Neural speech synthesis models can synthesize high quality speech but typically require a high computational complexity to do so.
RemixIT is based on a continuous self-training scheme in which a pre-trained teacher model on out-of-domain data infers estimated pseudo-target signals for in-domain mixtures.
We present a data-driven approach to automate audio signal processing by incorporating stateful third-party, audio effects as layers within a deep neural network.
We propose FEDENHANCE, an unsupervised federated learning (FL) approach for speech enhancement and separation with non-IID distributed data across multiple clients.
As a consequence, most audio machine learning models are designed to process fixed-size vector inputs which often prohibits the repurposing of learned models on audio with different sampling rates or alternative representations.
Recent progress in audio source separation lead by deep learning has enabled many neural network models to provide robust solutions to this fundamental estimation problem.
Ranked #2 on Speech Separation on WHAMR!
In this paper, we propose a simple, unified gradient reweighting scheme, with a lightweight modification to bias the learning process of a model and steer it towards a certain distribution of results.
In this paper, we present an efficient neural network for end-to-end general purpose audio source separation.
Ranked #7 on Speech Separation on WHAMR!
Supervised learning for single-channel speech enhancement requires carefully labeled training examples where the noisy mixture is input into the network and the network is trained to produce an output close to the ideal target.
Audio and Speech Processing Sound
In the first step we learn a transform (and it's inverse) to a latent space where masking-based separation performance using oracles is optimal.
Ranked #20 on Speech Separation on WSJ0-2mix
We show that by incrementally refining a classifier with generative replay a generator that is 4% of the size of all previous training data matches the performance of refining the classifier keeping 20% of all previous training data.
We propose a completely unsupervised method to understand audio scenes observed with random microphone arrangements by decomposing the scene into its constituent sources and their relative presence in each microphone.
We present a monophonic source separation system that is trained by only observing mixtures with no ground truth separation information.
The performance of single channel source separation algorithms has improved greatly in recent times with the development and deployment of neural networks.
Popular generative model learning methods such as Generative Adversarial Networks (GANs), and Variational Autoencoders (VAE) enforce the latent representation to follow simple distributions such as isotropic Gaussian.
We propose a novel speech enhancement method that is based on a Bayesian formulation of NMF (BNMF).
Nonnegative matrix factorization (NMF) has been actively investigated and used in a wide range of problems in the past decade.
In this paper, we propose NoiseOut, a fully automated pruning algorithm based on the correlation between activations of neurons in the hidden layers.
Based on the assumption that there exists a neural network that efficiently represents a set of Boolean functions between all binary inputs and outputs, we propose a process for developing and deploying neural networks whose weight parameters, bias terms, input, and intermediate hidden layer output signals, are all binary-valued, and require only basic bit logic for the feedforward pass.
We argue that due to the specific structure of the activation matrix $R$ in the shared component factorial mixture model, and an incoherence assumption on the shared component, it is possible to extract the columns of the $O$ matrix without the need for alternating between the estimation of $O$ and $R$.
In this paper, we explore joint optimization of masking functions and deep recurrent neural networks for monaural source separation tasks, including monaural speech separation, monaural singing voice separation, and speech denoising.