ActivityNet Captions

Introduced by Krishna et al. in Dense-Captioning Events in Videos

The ActivityNet Captions dataset is built on ActivityNet v1.3 which includes 20k YouTube untrimmed videos with 100k caption annotations. The videos are 120 seconds long on average. Most of the videos contain over 3 annotated events with corresponding start/end time and human-written sentences, which contain 13.5 words on average. The number of videos in train/validation/test split is 10024/4926/5044, respectively.

Source: Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning

Homepage

Benchmarks

Add a new result Link an existing benchmark

Task	Dataset Variant	Best Model
Dense Video Captioning	ActivityNet Captions	Vid2Seq
Natural Language Moment Retrieval	ActivityNet Captions	GVL
Video Captioning	ActivityNet Captions	VideoCoCa
Temporal Action Proposal Generation	ActivityNet Captions	BMT
Partially Relevant Video Retrieval	ActivityNet Captions	ms-sl