42 papers with code • 1 benchmarks • 7 datasets
Speaker Diarization is the task of segmenting and co-indexing audio recordings by speaker. The way the task is commonly defined, the goal is not to identify known speakers, but to co-index segments that are attributed to the same speaker; in other words, diarization implies finding speaker boundaries and grouping segments that belong to the same speaker, and, as a by-product, determining the number of distinct speakers. In combination with speech recognition, diarization enables speaker-attributed speech-to-text transcription.
The dataset contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible.
In this study, we propose a new spectral clustering framework that can auto-tune the parameters of the clustering algorithm in the context of speaker diarization.
Speech recognition (ASR) and speaker diarization (SD) models have traditionally been trained separately to produce rich conversation transcripts with speaker labels.
End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors
End-to-end speaker diarization for an unknown number of speakers is addressed in this paper.
Automatic Speech Recognition (ASR) systems are increasingly powerful and more accurate, but also more numerous with several options existing currently as a service (e. g. Google, IBM, and Microsoft).