VoxCeleb1 is an audio dataset containing over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.
604 PAPERS • 9 BENCHMARKS
VoxCeleb2 is a large scale speaker recognition dataset obtained automatically from open-source media. VoxCeleb2 consists of over a million utterances from over 6k speakers. Since the dataset is collected ‘in the wild’, the speech segments are corrupted with real world noise including laughter, cross-talk, channel effects, music and other sounds. The dataset is also multilingual, with speech from speakers of 145 different nationalities, covering a wide range of accents, ages, ethnicities and languages. The dataset is audio-visual, so is also useful for a number of other applications, for example – visual speech synthesis, speech separation, cross-modal transfer from face to voice or vice versa and training face recognition from video to complement existing face recognition datasets.
481 PAPERS • 5 BENCHMARKS
Consists of more than 210k videos for 310 audio classes.
146 PAPERS • 3 BENCHMARKS
CN-Celeb is a large-scale speaker recognition dataset collected `in the wild'. This dataset contains more than 130,000 utterances from 1,000 Chinese celebrities, and covers 11 different genres in real world.
62 PAPERS • 1 BENCHMARK
The CALLHOME English Corpus is a collection of unscripted telephone conversations between native speakers of English. Here are the key details:
11 PAPERS • 9 BENCHMARKS
The EVI dataset is a challenging, multilingual spoken-dialogue dataset with 5,506 dialogues in English, Polish, and French. The dataset can be used to develop and benchmark conversational systems for user authentication tasks, i.e. speaker enrolment (E), speaker verification (V), speaker identification (I).
1 PAPER • 3 BENCHMARKS