The ActivityNet Captions dataset is built on ActivityNet v1.3 which includes 20k YouTube untrimmed videos with 100k caption annotations. The videos are 120 seconds long on average. Most of the videos contain over 3 annotated events with corresponding start/end time and human-written sentences, which contain 13.5 words on average. The number of videos in train/validation/test split is 10024/4926/5044, respectively.
222 PAPERS • 5 BENCHMARKS
Charades-STA is a new dataset built on top of Charades by adding sentence temporal annotations.
185 PAPERS • 4 BENCHMARKS