Speech Separation
96 papers with code • 18 benchmarks • 16 datasets
The task of extracting all overlapping speech sources in a given mixed speech signal refers to the Speech Separation. Speech Separation is a special scenario of source separation problem, where the focus is only on the overlapping speech signal sources and other interferences such as music or noise signals are not the main concern of the study.
Source: A Unified Framework for Speech Separation
Image credit: Speech Separation of A Target Speaker Based on Deep Neural Networks
Libraries
Use these libraries to find Speech Separation models and implementationsLatest papers
SPMamba: State-space model is all you need in speech separation
Notably, within computer vision, Mamba-based methods have been celebrated for their formidable performance and reduced computational requirements.
Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation
In this work, we replace transformers with Mamba, a selective state space model, for speech separation.
Online speaker diarization of meetings guided by speech separation
The results show that our system improves the state-of-the-art on the AMI headset mix, using no oracle information and under full evaluation (no collar and including overlapped speech).
TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down Fusion
TDANet serves as the architectural foundation for the auditory and visual networks within TDFNet, offering an efficient model with fewer parameters.
On Time Domain Conformer Models for Monaural Speech Separation in Noisy Reverberant Acoustic Environments
Convolution augmented transformers (conformers) have performed well for many speech processing tasks but have been under-researched for speech separation.
RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation
This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
SPGM: Prioritizing Local Features for enhanced speech separation performance
Dual-path is a popular architecture for speech separation models (e. g. Sepformer) which splits long sequences into overlapping chunks for its intra- and inter-blocks that separately model intra-chunk local features and inter-chunk global relationships.
Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model
We propose Audio-Visual Lightweight ITerative model (AVLIT), an effective and lightweight neural network that uses Progressive Learning (PL) to perform audio-visual speech separation in noisy environments.
A Neural State-Space Model Approach to Efficient Speech Separation
In this work, we introduce S4M, a new efficient speech separation framework based on neural state-space models (SSM).
MossFormer: Pushing the Performance Limit of Monaural Speech Separation using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions
To effectively solve the indirect elemental interactions across chunks in the dual-path architecture, MossFormer employs a joint local and global self-attention architecture that simultaneously performs a full-computation self-attention on local chunks and a linearised low-cost self-attention over the full sequence.