TitaNet: Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context

8 Oct 2021  ·  Nithin Rao Koluguri, Taejin Park, Boris Ginsburg ·

In this paper, we propose TitaNet, a novel neural network architecture for extracting speaker representations. We employ 1D depth-wise separable convolutions with Squeeze-and-Excitation (SE) layers with global context followed by channel attention based statistics pooling layer to map variable-length utterances to a fixed-length embedding (t-vector). TitaNet is a scalable architecture and achieves state-of-the-art performance on speaker verification task with an equal error rate (EER) of 0.68% on the VoxCeleb1 trial file and also on speaker diarization tasks with diarization error rate (DER) of 1.73% on AMI-MixHeadset, 1.99% on AMI-Lapel and 1.11% on CH109. Furthermore, we investigate various sizes of TitaNet and present a light TitaNet-S model with only 6M parameters that achieve near state-of-the-art results in diarization tasks.

PDF Abstract

Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Speaker Diarization AMI Lapel TitaNet-L (NME-SC) DER(%) 2.03 # 3
Speaker Diarization AMI Lapel TitaNet-M (NME-SC) DER(%) 1.99 # 1
Speaker Diarization AMI Lapel TitaNet-S (NME-SC) DER(%) 2.00 # 2
Speaker Diarization AMI Lapel ECAPA (SC) DER(%) 2.36 # 4
Speaker Diarization AMI MixHeadset TitaNet-L (NME-SC) DER(%) 1.73 # 1
Speaker Diarization AMI MixHeadset TitaNet-M (NME-SC) DER(%) 1.79 # 3
Speaker Diarization AMI MixHeadset ECAPA (SC) DER(%) 1.78 # 2
Speaker Diarization AMI MixHeadset TitaNet-S (NME-SC) DER(%) 2.22 # 4
Speaker Diarization CH109 TitaNet-S (NME-SC) DER(%) 1.11 # 1
Speaker Diarization CH109 x-vector (PLDA + AHC) DER(%) 9.72 # 4
Speaker Diarization CH109 TitaNet-L (NME-SC) DER(%) 1.19 # 3
Speaker Diarization CH109 TitaNet-M (NME-SC) DER(%) 1.13 # 2
Speaker Diarization NIST-SRE 2000 TitaNet-M (NME-SC) DER(%) 6.47 # 3
Speaker Diarization NIST-SRE 2000 TitaNet-S (NME-SC) DER(%) 6.37 # 2
Speaker Diarization NIST-SRE 2000 TitaNet-L (NME-SC) DER(%) 6.73 # 4
Speaker Diarization NIST-SRE 2000 x-vector (MCGAN) DER(%) 5.73 # 1
Speaker Diarization NIST-SRE 2000 x-vector (PLDA + AHC) DER(%) 8.39 # 5


No methods listed for this paper. Add relevant methods here