…The segments are of varying length, between 3 and 10 seconds long, and in each clip the only visible face in the video and audible sound in the soundtrack belong to a single speaking person. In total, the dataset contains roughly 4700 hours of video segments with approximately 150,000 distinct speakers, spanning a wide variety of people, languages and face poses.
35 PAPERS • NO BENCHMARKS YET
…(1) wikiann · Datasets at Hugging Face. https://huggingface.co/datasets/wikiann. (2) wikiann | TensorFlow Datasets. https://tensorflow.google.cn/datasets/catalog/wikiann. (3) wikiann · Datasets at Hugging Face. https://huggingface.co/datasets/wikiann/viewer/en. (4) WikiAnn Dataset | Papers With Code. https://paperswithcode.com/dataset/wikiann-1.
58 PAPERS • 3 BENCHMARKS