VoxCeleb2

Introduced by Chung et al. in VoxCeleb2: Deep Speaker Recognition

VoxCeleb2 is a large scale speaker recognition dataset obtained automatically from open-source media. VoxCeleb2 consists of over a million utterances from over 6k speakers. Since the dataset is collected ‘in the wild’, the speech segments are corrupted with real world noise including laughter, cross-talk, channel effects, music and other sounds. The dataset is also multilingual, with speech from speakers of 145 different nationalities, covering a wide range of accents, ages, ethnicities and languages. The dataset is audio-visual, so is also useful for a number of other applications, for example – visual speech synthesis, speech separation, cross-modal transfer from face to voice or vice versa and training face recognition from video to complement existing face recognition datasets.

Source: VoxCeleb2: Deep Speaker Recognition

Homepage

Benchmarks

Add a new result Link an existing benchmark

Task	Dataset Variant	Best Model
Talking Head Generation	VoxCeleb2 - 1-shot learning	Fast Bi-layer Avatars
Speech Separation	VoxCeleb2	RTFS-Net-4
Talking Head Generation	VoxCeleb2 - 8-shot learning	CainGAN
Talking Head Generation	VoxCeleb2 - 32-shot learning	Few-shot Adversarial Model
Speaker Verification	VoxCeleb2	ResNet-50