The ActivityNet dataset contains 200 different types of activities and a total of 849 hours of videos collected from YouTube. ActivityNet is the largest benchmark for temporal activity detection to date in terms of both the number of activity categories and number of videos, making the task particularly challenging. Version 1.3 of the dataset contains 19994 untrimmed videos in total and is divided into three disjoint subsets, training, validation, and testing by a ratio of 2:1:1. On average, each activity category has 137 untrimmed videos. Each video on average has 1.41 activities which are annotated with temporal boundaries. The ground-truth annotations of test videos are not public.
693 PAPERS • 19 BENCHMARKS
The THUMOS14 (THUMOS 2014) dataset is a large-scale video dataset that includes 1,010 videos for validation and 1,574 videos for testing from 20 classes. Among all the videos, there are 220 and 212 videos with temporal annotations in validation and testing set, respectively.
290 PAPERS • 20 BENCHMARKS
The Georgia Tech Egocentric Activities (GTEA) dataset contains seven types of daily activities such as making sandwich, tea, or coffee. Each activity is performed by four different people, thus totally 28 videos. For each video, there are about 20 fine-grained action instances such as take bread, pour ketchup, in approximately one minute.
105 PAPERS • 2 BENCHMARKS
FineAction contains 103K temporal instances of 106 action categories, annotated in 17K untrimmed videos. FineAction introduces new opportunities and challenges for temporal action localization, thanks to its distinct characteristics of fine action classes with rich diversity, dense annotations of multiple instances, and co-occurring actions of different classes.
15 PAPERS • 3 BENCHMARKS
The BEOID dataset includes object interactions ranging from preparing a coffee to operating a weight lifting machine and opening a door. The dataset is recorded at six locations: kitchen, workspace, laser printer, corridor with a locked door, cardiac gym, and weight-lifting machine. For the first four locations, sequences from five different operators were recorded (two sequences per operator), and from three operators for the last two locations (three sequences per operator). The wearable gaze tracker hardware (ASL Mobile Eye XG) was used to record the dataset. Synchronized wide-lens video data with calibrated 2D gaze fixations are available. Moreover, we release 3D information using a pre-built cloud point map and PTAM tracking. Three-dimensional information of the image and the gaze fixations are included.
5 PAPERS • 1 BENCHMARK