Speaker Diarization
90 papers with code • 14 benchmarks • 11 datasets
Speaker Diarization is the task of segmenting and co-indexing audio recordings by speaker. The way the task is commonly defined, the goal is not to identify known speakers, but to co-index segments that are attributed to the same speaker; in other words, diarization implies finding speaker boundaries and grouping segments that belong to the same speaker, and, as a by-product, determining the number of distinct speakers. In combination with speech recognition, diarization enables speaker-attributed speech-to-text transcription.
Source: Improving Diarization Robustness using Diversification, Randomization and the DOVER Algorithm
Libraries
Use these libraries to find Speaker Diarization models and implementationsDatasets
Most implemented papers
AVA-AVD: Audio-Visual Speaker Diarization in the Wild
Audio-visual speaker diarization aims at detecting "who spoke when" using both auditory and visual signals.
Speaker Diarization with LSTM
For many years, i-vector based audio embedding techniques were the dominant approach for speaker verification and speaker diarization applications.
pyannote.audio: neural building blocks for speaker diarization
We introduce pyannote. audio, an open-source toolkit written in Python for speaker diarization.
Speech Recognition and Multi-Speaker Diarization of Long Conversations
Speech recognition (ASR) and speaker diarization (SD) models have traditionally been trained separately to produce rich conversation transcripts with speaker labels.
End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors
End-to-end speaker diarization for an unknown number of speakers is addressed in this paper.
The Third DIHARD Diarization Challenge
DIHARD III was the third in a series of speaker diarization challenges intended to improve the robustness of diarization systems to variability in recording equipment, noise conditions, and conversational domain.
Speech Emotion Diarization: Which Emotion Appears When?
Speech Emotion Recognition (SER) typically relies on utterance-level solutions.
End-to-End Neural Speaker Diarization with Self-attention
Our method was even better than that of the state-of-the-art x-vector clustering-based method.
VoxLingua107: a Dataset for Spoken Language Recognition
Speech activity detection and speaker diarization are used to extract segments from the videos that contain speech.
A Comprehensive Evaluation of Incremental Speech Recognition and Diarization for Conversational AI
Automatic Speech Recognition (ASR) systems are increasingly powerful and more accurate, but also more numerous with several options existing currently as a service (e. g. Google, IBM, and Microsoft).