3 dataset results for Audio-Visual Active Speaker Detection

VPCD contains multi-modal annotations (face, body and voice) for all primary and secondary characters from a range of diverse TV-shows and movies. It is used for evaluating multi-modal person-clustering. It contains body-tracks for each annotated character, face-tracks when visible, and voice-tracks when speaking, with their associated features.

7 PAPERS • 1 BENCHMARK

AVA (Atomic Visual Actions)

AVA is a project that provides audiovisual annotations of video for improving our understanding of human activity. Each of the video clips has been exhaustively annotated by human annotators, and together they represent a rich variety of scenes, recording conditions, and expressions of human activity. There are annotations for:

95 PAPERS • 7 BENCHMARKS

AVA-ActiveSpeaker

Contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible. This dataset contains about 3.65 million human labeled frames or about 38.5 hours of face tracks, and the corresponding audio.

19 PAPERS • 1 BENCHMARK

Datasets

3 dataset results for Audio-Visual Active Speaker Detection