VoxCeleb1 is an audio dataset containing over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.
604 PAPERS • 9 BENCHMARKS
Consists of more than 210k videos for 310 audio classes.
145 PAPERS • 3 BENCHMARKS
CN-Celeb is a large-scale speaker recognition dataset collected `in the wild'. This dataset contains more than 130,000 utterances from 1,000 Chinese celebrities, and covers 11 different genres in real world.
62 PAPERS • 1 BENCHMARK
The football keyword dataset (FKD), as a new keyword spotting dataset in Persian, is collected with crowdsourcing. This dataset contains nearly 31000 samples in 18 classes.
2 PAPERS • 2 BENCHMARKS
A Rich Annotated Mandarin Conversational (RAMC) Speech Dataset, including 180 hours of Mandarin Chinese dialogue, 150, 10 and 20 hours for the training set, development set and test set respectively. It contains 351 multi-turn dialogues, each of which is a coherent and compact conversation centered around one theme.
1 PAPER • NO BENCHMARKS YET
MAVS is an audio-visual smartphone dataset captured in five different recent smartphones. This new dataset contains 103 subjects captured in three different sessions considering the different real-world scenarios. Three different languages are acquired in this dataset to include the problem of language dependency of the speaker recognition systems.