TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

6 Aug 2024  ·  Kohei Saijo, Gordon Wichern, François G. Germain, Zexu Pan, Jonathan Le Roux ·

Time-frequency (TF) domain dual-path models achieve high-fidelity speech separation. While some previous state-of-the-art (SoTA) models rely on RNNs, this reliance means they lack the parallelizability, scalability, and versatility of Transformer blocks. Given the wide-ranging success of pure Transformer-based architectures in other fields, in this work we focus on removing the RNN from TF-domain dual-path models, while maintaining SoTA performance. This work presents TF-Locoformer, a Transformer-based model with LOcal-modeling by COnvolution. The model uses feed-forward networks (FFNs) with convolution layers, instead of linear layers, to capture local information, letting the self-attention focus on capturing global patterns. We place two such FFNs before and after self-attention to enhance the local-modeling capability. We also introduce a novel normalization for TF-domain dual-path models. Experiments on separation and enhancement datasets show that the proposed model meets or exceeds SoTA in multiple benchmarks with an RNN-free architecture.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Speech Enhancement Deep Noise Suppression (DNS) Challenge TF-Locoformer (M) SI-SDR-WB 23.3 # 1
PESQ-WB 3.72 # 2
STOI 98.8 # 1
Number of parameters (M) 15 # 3
FLOPS (G) 497.24 # 3
Speech Separation Libri2Mix TF-Locoformer (M) SI-SDRi 22.1 # 2
SDRi 22.2 # 1
Number of parameters (M) 15 # 1
Speech Separation WHAMR! TF-Locoformer (S) SI-SDRi 17.4 # 2
SDRi 15.9 # 2
Number of parameters (M) 5 # 3
Speech Separation WHAMR! TF-Locoformer (M) SI-SDRi 18.5 # 1
SDRi 16.9 # 1
Number of parameters (M) 15 # 4
Speech Separation WSJ0-2mix TF-Locoformer (M) SI-SDRi 23.6 # 8
SDRi 23.8 # 5
Number of parameters (M) 15.0 # 6
Speech Separation WSJ0-2mix TF-Locoformer (S) + DM SI-SDRi 22.8 # 9
SDRi 23 # 6
Number of parameters (M) 5.0 # 3
Speech Separation WSJ0-2mix TF-Locoformer (S) SI-SDRi 22 # 17
SDRi 22.1 # 9
Number of parameters (M) 5.0 # 3
Speech Separation WSJ0-2mix TF-Locoformer (L) SI-SDRi 24.2 # 4
SDRi 24.3 # 4
Number of parameters (M) 22.5 # 8
Speech Separation WSJ0-2mix TF-Locoformer (L) + DM SI-SDRi 25.1 # 1
SDRi 25.2 # 1
Number of parameters (M) 22.5 # 8
Speech Separation WSJ0-2mix TF-Locoformer (M) + DM SI-SDRi 24.6 # 3
SDRi 24.7 # 3
Number of parameters (M) 15.0 # 6

Methods