CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement

22 Sep 2022  ·  Sherif Abdulatif, Ruizhe Cao, Bin Yang ·

Convolution-augmented transformers (Conformers) are recently proposed in various speech-domain applications, such as automatic speech recognition (ASR) and speech separation, as they can capture both local and global dependencies. In this paper, we propose a conformer-based metric generative adversarial network (CMGAN) for speech enhancement (SE) in the time-frequency (TF) domain. The generator encodes the magnitude and complex spectrogram information using two-stage conformer blocks to model both time and frequency dependencies. The decoder then decouples the estimation into a magnitude mask decoder branch to filter out unwanted distortions and a complex refinement branch to further improve the magnitude estimation and implicitly enhance the phase information. Additionally, we include a metric discriminator to alleviate metric mismatch by optimizing the generator with respect to a corresponding evaluation score. Objective and subjective evaluations illustrate that CMGAN is able to show superior performance compared to state-of-the-art methods in three speech enhancement tasks (denoising, dereverberation and super-resolution). For instance, quantitative denoising analysis on Voice Bank+DEMAND dataset indicates that CMGAN outperforms various previous models with a margin, i.e., PESQ of 3.41 and SSNR of 11.10 dB.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Audio Super-Resolution VCTK Multi-Speaker CMGAN Log-Spectral Distance 0.76 # 1
Speech Enhancement VoiceBank + DEMAND CMGAN PESQ 3.41 # 3
CSIG 4.63 # 2
CBAK 3.94 # 2
COVL 4.12 # 2
STOI 96 # 1
SSNR 11.1 # 1


No methods listed for this paper. Add relevant methods here