Continuous Speech Separation with Conformer

13 Aug 2020  ·  Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Jinyu Li, Takuya Yoshioka, Chengyi Wang, Shujie Liu, Ming Zhou ·

Continuous speech separation plays a vital role in complicated speech related tasks such as conversation transcription. The separation model extracts a single speaker signal from a mixed speech. In this paper, we use transformer and conformer in lieu of recurrent neural networks in the separation system, as we believe capturing global information with the self-attention based method is crucial for the speech separation. Evaluating on the LibriCSS dataset, the conformer separation model achieves state of the art results, with a relative 23.5% word error rate (WER) reduction from bi-directional LSTM (BLSTM) in the utterance-wise evaluation and a 15.4% WER reduction in the continuous evaluation.

PDF Abstract

Datasets


Results from the Paper


 Ranked #1 on Speech Separation on LibriCSS (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Speech Separation LibriCSS Conformer (large) 0S 5.4 # 1
40% 17.1 # 1
0L 5.0 # 1
10% 7.5 # 1
20% 10.7 # 1
30% 13.8 # 1
Speech Separation LibriCSS Conformer (base) 0S 5.6 # 2
40% 18.9 # 2
0L 5.4 # 2
10% 8.2 # 2
20% 11.8 # 2
30% 15.5 # 2

Methods