…Each video is labelled with 3.91 step segments, where each segment lasts 14.91 seconds on average. In total, the dataset contains videos of 476 hours, with 46,354 annotated segments.
78 PAPERS • 2 BENCHMARKS
EPIC-SOUNDS includes 78.4k categorised and 39.2k non-categorised segments of audible events and actions, distributed across 44 classes.
7 PAPERS • 2 BENCHMARKS
…Sequences are annotated with more than 100K coarse and 1M fine-grained action segments, and 18M 3D hand poses. We benchmark on three action understanding tasks: recognition, anticipation and temporal segmentation. Additionally, we propose a novel task of detecting mistakes.
38 PAPERS • 4 BENCHMARKS
…The dataset offers high quality, pixel-level segmentations of hands the possibility to semantically distinguish between the observer’s hands and someone else’s hands, as well as left and right hands
30 PAPERS • NO BENCHMARKS YET
…Each video is split into short action segments (mean duration is 3.7s) with specific start and end times and a verb and noun annotation describing the action (e.g. ‘open fridge‘).
35 PAPERS • 3 BENCHMARKS
…More specifically, the dataset contains 50 hours of annotated videos to localize relevant animal behavior segments in long videos for the video grounding task, 30K video sequences for the fine-grained
14 PAPERS • 2 BENCHMARKS
…A case was composed of kinematic data, a video, semantic segmentation of each frame, and workflow annotation.
3 PAPERS • 6 BENCHMARKS
…Video Acquisition: 1920x1080 at 12.00 fps 11 training videos and 9 validation/test videos 8857 video segments temporally annotated indicating the verbs which describe the actions performed 64349 active
14 PAPERS • 3 BENCHMARKS
…We benchmark four foundational video understanding tasks: action recognition, action segmentation, object detection and multi-object tracking.
1 PAPER • NO BENCHMARKS YET
…EPIC-KITCHENS-55), EPIC-KITCHENS-100 has been annotated using a novel pipeline that allows denser (54% more actions per minute) and more complete annotations of fine-grained actions (+128% more action segments
135 PAPERS • 7 BENCHMARKS
…AVA Speech densely annotates audio-based speech activity in AVA v1.0 videos, and explicitly labels 3 background noise conditions, resulting in ~46K labeled segments spanning 45 hours of data.
94 PAPERS • 7 BENCHMARKS