Speaker Diarization Using Stereo Audio Channels: Preliminary Study on Utterance Clustering

Speaker diarization is one of the actively researched topics in audio signal processing and machine learning. Utterance clustering is a critical part of a speaker diarization task... In this study, we aim to improve the performance of utterance clustering by processing multichannel (stereo) audio signals. We generated processed audio signals by combining left- and right-channel audio signals in a few different ways and then extracted embedded features (also called d-vectors) from those processed audio signals. We applied the Gaussian mixture model (GMM) for supervised utterance clustering. In the training phase, we used a parameter sharing GMM to train the model for each speaker. In the testing phase, we selected the speaker with the maximum likelihood as the detected speaker. Results of experiments with real audio recordings of multi-person discussion sessions showed that our proposed method that used multichannel audio signals achieved significantly better performance than a conventional method with mono audio signals. read more

PDF Abstract
No code implementations yet. Submit your code now

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here