Speech Separation
115 papers with code • 19 benchmarks • 16 datasets
The task of extracting all overlapping speech sources in a given mixed speech signal refers to the Speech Separation. Speech Separation is a special scenario of source separation problem, where the focus is only on the overlapping speech signal sources and other interferences such as music or noise signals are not the main concern of the study.
Source: A Unified Framework for Speech Separation
Image credit: Speech Separation of A Target Speaker Based on Deep Neural Networks
Libraries
Use these libraries to find Speech Separation models and implementationsMost implemented papers
Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time-frequency representation for speech separation, and the long latency in calculating the spectrograms.
Deep clustering: Discriminative embeddings for segmentation and separation
The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources.
Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation
Recent studies in deep learning-based speech separation have proven the superiority of time-domain approaches to conventional time-frequency-based methods.
Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation
By introduces a improved transformer, elements in speech sequences can interact directly, which enables DPTNet can model for the speech sequences with direct context-awareness.
Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video.
Voice Separation with an Unknown Number of Multiple Speakers
We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously.
Sudo rm -rf: Efficient Networks for Universal Audio Source Separation
In this paper, we present an efficient neural network for end-to-end general purpose audio source separation.
Attention is All You Need in Speech Separation
Transformers are emerging as a natural alternative to standard RNNs, replacing recurrent computations with a multi-head attention mechanism.
Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks
We evaluated uPIT on the WSJ0 and Danish two- and three-talker mixed-speech separation tasks and found that uPIT outperforms techniques based on Non-negative Matrix Factorization (NMF) and Computational Auditory Scene Analysis (CASA), and compares favorably with Deep Clustering (DPCL) and the Deep Attractor Network (DANet).
TasNet: time-domain audio separation network for real-time, single-channel speech separation
We directly model the signal in the time-domain using an encoder-decoder framework and perform the source separation on nonnegative encoder outputs.