The Kinetics dataset is a large-scale, high-quality dataset for human action recognition in videos. The dataset consists of around 500,000 video clips covering 600 human action classes with at least 600 video clips for each action class. Each video clip lasts around 10 seconds and is labeled with a single action class. The videos are collected from YouTube.
638 PAPERS • 18 BENCHMARKS
The Charades dataset is composed of 9,848 videos of daily indoors activities with an average length of 30 seconds, involving interactions with 46 objects classes in 15 types of indoor scenes and containing a vocabulary of 30 verbs leading to 157 action classes. Each video in this dataset is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacting objects. 267 different users were presented with a sentence, which includes objects and actions from a fixed vocabulary, and they recorded a video acting out the sentence. In total, the dataset contains 66,500 temporal annotations for 157 action classes, 41,104 labels for 46 object classes, and 27,847 textual descriptions of the videos. In the standard split there are7,986 training video and 1,863 validation video.
240 PAPERS • 4 BENCHMARKS
The 20BN-SOMETHING-SOMETHING V2 dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with everyday objects. The dataset was created by a large number of crowd workers. It allows machine learning models to develop fine-grained understanding of basic actions that occur in the physical world. It contains 220,847 videos, with 168,913 in the training set, 24,777 in the validation set and 27,157 in the test set. There are 174 labels.
106 PAPERS • 3 BENCHMARKS
The YouTube-8M dataset is a large scale video dataset, which includes more than 7 million videos with 4716 classes labeled by the annotation system. The dataset consists of three parts: training set, validate set, and test set. In the training set, each class contains at least 100 training videos. Features of these videos are extracted by the state-of-the-art popular pre-trained models and released for public use. Each video contains audio and visual modality. Based on the visual information, videos are divided into 24 topics, such as sports, game, arts & entertainment, etc
104 PAPERS • 2 BENCHMARKS
The 20BN-SOMETHING-SOMETHING dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with everyday objects. The dataset was created by a large number of crowd workers. It allows machine learning models to develop fine-grained understanding of basic actions that occur in the physical world. It contains 108,499 videos, with 86,017 in the training set, 11,522 in the validation set and 10,960 in the test set. There are 174 labels.
82 PAPERS • 2 BENCHMARKS
Contains 68,536 activity instances in 68.8 hours of first and third-person video, making it one of the largest and most diverse egocentric datasets available. Charades-Ego furthermore shares activity classes, scripts, and methodology with the Charades dataset, that consist of additional 82.3 hours of third-person video with 66,500 activity instances.
8 PAPERS • NO BENCHMARKS YET
Whereas the action recognition community has focused mostly on detecting simple actions like clapping, walking or jogging, the detection of fights or in general aggressive behaviors has been comparatively less studied. Such capability may be extremely useful in some video surveillance scenarios like in prisons, psychiatric or elderly centers or even in camera phones. After an analysis of previous approaches we test the well-known Bag-of-Words framework used for action recognition in the specific problem of fight detection, along with two of the best action descriptors currently available: STIP and MoSIFT. For the purpose of evaluation and to foster research on violence detection in video we introduce a new video database containing 1000 sequences divided in two groups: fights and non-fights. Experiments on this database and another one with fights from action movies show that fights can be detected with near 90% accuracy.
1 PAPER • 1 BENCHMARK
Home Action Genome is a large-scale multi-view video database of indoor daily activities. Every activity is captured by synchronized multi-view cameras, including an egocentric view. There are 30 hours of vides with 70 classes of daily activities and 453 classes of atomic actions.
1 PAPER • 1 BENCHMARK
MetaVD is a Meta Video Dataset for enhancing human action recognition datasets. It provides human-annotated relationship labels between action classes across human action recognition datasets. MetaVD is proposed in the following paper: Yuya Yoshikawa, Yutaro Shigeto, and Akikazu Takeuchi. "MetaVD: A Meta Video Dataset for enhancing human action recognition datasets." Computer Vision and Image Understanding 212 (2021): 103276. [link]
1 PAPER • NO BENCHMARKS YET
Dataset for multimodal skills assessment focusing on assessing piano player’s skill level. Annotations include player's skills level, and song difficulty level. Bounding box annotations around pianists' hands are also provided.
1 PAPER • 3 BENCHMARKS
First of its kind paired win-fail action understanding dataset with samples from the following domains: “General Stunts,” “Internet Wins-Fails,” “Trick Shots,” & “Party Games.” The task is to identify successful and failed attempts at various activities. Unlike existing action recognition datasets, intra-class variation is high making the task challenging, yet feasible.
1 PAPER • 2 BENCHMARKS
This dataset defines a total of 11 crowd motion patterns and it is composed of over 6000 video sequences with an average length of 100 frames per sequence. This documentation presents how to download and process the Crowd-11 dataset.
0 PAPER • NO BENCHMARKS YET