TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Speaker Diarization	CALLHOME	SA-EEND (2-spk, adapted)	DER(%)	10.76	# 2
Speaker Diarization	CALLHOME	SA-EEND (2-spk, adapted)	FA	6.68	# 2
Speaker Diarization	CALLHOME	SA-EEND (2-spk, adapted)	MI	2.40	# 1
Speaker Diarization	CALLHOME	SA-EEND (2-spk, adapted)	CF	1.68	# 2
Speaker Diarization	CALLHOME	SA-EEND (2-spk, no-adapt)	DER(%)	12.66	# 4
Speaker Diarization	CALLHOME	SA-EEND (2-spk, no-adapt)	FA	7.42	# 3
Speaker Diarization	CALLHOME	SA-EEND (2-spk, no-adapt)	MI	3.93	# 2
Speaker Diarization	CALLHOME	SA-EEND (2-spk, no-adapt)	CF	1.31	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/end-to-end-neural-speaker-diarization-with/speaker-diarization-on-callhome)](https://paperswithcode.com/sota/speaker-diarization-on-callhome?p=end-to-end-neural-speaker-diarization-with)`

End-to-End Neural Speaker Diarization with Self-attention

13 Sep 2019 · Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Yawen Xue, Kenji Nagamatsu, Shinji Watanabe ·

Speaker diarization has been mainly developed based on the clustering of speaker embeddings. However, the clustering-based approach has two major problems; i.e., (i) it is not optimized to minimize diarization errors directly, and (ii) it cannot handle speaker overlaps correctly. To solve these problems, the End-to-End Neural Diarization (EEND), in which a bidirectional long short-term memory (BLSTM) network directly outputs speaker diarization results given a multi-talker recording, was recently proposed. In this study, we enhance EEND by introducing self-attention blocks instead of BLSTM blocks. In contrast to BLSTM, which is conditioned only on its previous and next hidden states, self-attention is directly conditioned on all the other frames, making it much suitable for dealing with the speaker diarization problem. We evaluated our proposed method on simulated mixtures, real telephone calls, and real dialogue recordings. The experimental results revealed that the self-attention was the key to achieving good performance and that our proposed method performed significantly better than the conventional BLSTM-based method. Our method was even better than that of the state-of-the-art x-vector clustering-based method. Finally, by visualizing the latent representation, we show that the self-attention can capture global speaker characteristics in addition to local speech activity dynamics. Our source code is available online at https://github.com/hitachi-speech/EEND.

PDF Abstract

Code

Add Remove Mark official

hitachi-speech/EEND official

347

JunzheJosephZhu/EEND-for-LENA

Tasks

Add Remove

Clustering

speaker-diarization

Speaker Diarization

Datasets

CALLHOME American English Speech

Results from the Paper

Edit

Ranked #2 on Speaker Diarization on CALLHOME

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Speaker Diarization	CALLHOME	SA-EEND (2-spk, adapted)	DER(%)	10.76	# 2	Compare
			FA	6.68	# 2	Compare
			MI	2.40	# 1	Compare
			CF	1.68	# 2	Compare
Speaker Diarization	CALLHOME	SA-EEND (2-spk, no-adapt)	DER(%)	12.66	# 4	Compare
			FA	7.42	# 3	Compare
			MI	3.93	# 2	Compare
			CF	1.31	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

End-to-End Neural Speaker Diarization with Self-attention

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove