1 dataset result for segmentation AND Dense Video Captioning

The ViTT dataset consists of human produced segment-level annotations for 8,169 videos. Of these, 5,840 videos have been annotated once, and the rest of the videos have been annotated twice or more.

11 PAPERS • 2 BENCHMARKS

Datasets

1 dataset result for segmentation AND Dense Video Captioning