The 20BN-SOMETHING-SOMETHING V2 dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with everyday objects. The dataset was created by a large number of crowd workers. It allows machine learning models to develop fine-grained understanding of basic actions that occur in the physical world. It contains 220,847 videos, with 168,913 in the training set, 24,777 in the validation set and 27,157 in the test set. There are 174 labels.
242 PAPERS • 8 BENCHMARKS
The 20BN-SOMETHING-SOMETHING dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with everyday objects. The dataset was created by a large number of crowd workers. It allows machine learning models to develop fine-grained understanding of basic actions that occur in the physical world. It contains 108,499 videos, with 86,017 in the training set, 11,522 in the validation set and 10,960 in the test set. There are 174 labels.
115 PAPERS • 3 BENCHMARKS
The Penn Action Dataset contains 2326 video sequences of 15 different actions and human joint annotations for each sequence.
102 PAPERS • 4 BENCHMARKS
Volleyball is a video action recognition dataset. It has 4830 annotated frames that were handpicked from 55 videos with 9 player action labels and 8 team activity labels. It contains group activity annotations as well as individual activity annotations.
74 PAPERS • 3 BENCHMARKS
The UTD-MHAD dataset consists of 27 different actions performed by 8 subjects. Each subject repeated the action for 4 times, resulting in 861 action sequences in total. The RGB, depth, skeleton and the inertial sensor signals were recorded.
55 PAPERS • 2 BENCHMARKS
The EMOTIC dataset, named after EMOTions In Context, is a database of images with people in real environments, annotated with their apparent emotions. The images are annotated with an extended list of 26 emotion categories combined with the three common continuous dimensions Valence, Arousal and Dominance.
30 PAPERS • 6 BENCHMARKS
The EgoGesture dataset contains 2,081 RGB-D videos, 24,161 gesture samples and 2,953,224 frames from 50 distinct subjects.
30 PAPERS • 2 BENCHMARKS
The EgoHands dataset contains 48 Google Glass videos of complex, first-person interactions between two people. The main intention of this dataset is to enable better, data-driven approaches to understanding hands in first-person computer vision. The dataset offers
30 PAPERS • NO BENCHMARKS YET
CholecT50 is a dataset of endoscopic videos of laparoscopic cholecystectomy surgery introduced to enable research on fine-grained action recognition in laparoscopic surgery. It is annotated with triplet information in the form of <instrument, verb, target>. The dataset is a collection of 50 videos consisting of 45 videos from the Cholec80 dataset and 5 videos from an in-house dataset of the same surgical procedure.
20 PAPERS • 7 BENCHMARKS
Biased Action Recognition (BAR) dataset is a real-world image dataset categorized as six action classes which are biased to distinct places. The authors settle these six action classes by inspecting imSitu, which provides still action images from Google Image Search with action and place labels. In detail, the authors choose action classes where images for each of these candidate actions share common place characteristics. At the same time, the place characteristics of action class candidates should be distinct in order to classify the action only from place attributes. The select pairs are six typical action-place pairs: (Climbing, RockWall), (Diving, Underwater), (Fishing, WaterSurface), (Racing, APavedTrack), (Throwing, PlayingField),and (Vaulting, Sky).
19 PAPERS • 1 BENCHMARK
The dataset collected at the University of Florence during 2012, has been captured using a Kinect camera. It includes 9 activities: wave, drink from a bottle, answer phone,clap, tight lace, sit down, stand up, read watch, bow. During acquisition, 10 subjects were asked to perform the above actions for 2/3 times. This resulted in a total of 215 activity samples.
18 PAPERS • 1 BENCHMARK
Animal Kingdom is a large and diverse dataset that provides multiple annotated tasks to enable a more thorough understanding of natural animal behaviors. The wild animal footage used in the dataset records different times of the day in an extensive range of environments containing variations in backgrounds, viewpoints, illumination and weather conditions. More specifically, the dataset contains 50 hours of annotated videos to localize relevant animal behavior segments in long videos for the video grounding task, 30K video sequences for the fine-grained multi-label action recognition task, and 33K frames for the pose estimation task, which correspond to a diverse range of animals with 850 species across 6 major animal classes.
14 PAPERS • 2 BENCHMARKS
CholecT45 is a subset of CholecT50 consisting of 45 videos from the Cholec80 dataset. It is the first public release of part of CholecT50 dataset. CholecT50 is a dataset of 50 endoscopic videos of laparoscopic cholecystectomy surgery introduced to enable research on fine-grained action recognition in laparoscopic surgery. It is annotated with 100 triplet classes in the form of <instrument, verb, target>.
13 PAPERS • 2 BENCHMARKS
The Watch-n-Patch dataset was created with the focus on modeling human activities, comprising multiple actions in a completely unsupervised setting. It is collected with Microsoft Kinect One sensor for a total length of about 230 minutes, divided in 458 videos. 7 subjects perform human daily activities in 8 offices and 5 kitchens with complex backgrounds. Moreover, skeleton data are provided as ground truth annotations.
12 PAPERS • NO BENCHMARKS YET
The TUM Kitchen dataset is an action recognition dataset that contains 20 video sequences captured by 4 cameras with overlapping views. The camera network captures the scene from four viewpoints with 25 fps, and every RGB frame is of the resolution 384×288 by pixels. The action labels are frame-wise, and provided for the left arm, the right arm and the torso separately.
7 PAPERS • NO BENCHMARKS YET
Simitate is a hybrid benchmarking suite targeting the evaluation of approaches for imitation learning. It consists on a dataset containing 1938 sequences where humans perform daily activities in a realistic environment. The dataset is strongly coupled with an integration into a simulator. RGB and depth streams with a resolution of 960×540 at 30Hz and accurate ground truth poses for the demonstrator's hand, as well as the object in 6 DOF at 120Hz are provided. Along with the dataset the 3D model of the used environment and labelled object images are also provided.
5 PAPERS • NO BENCHMARKS YET
Verse is a new dataset that augments existing multimodal datasets (COCO and TUHOI) with sense labels.
4 PAPERS • NO BENCHMARKS YET
PETRAW data set was composed of 150 sequences of peg transfer training sessions. The objective of the peg transfer session is to transfer 6 blocks from the left to the right and back. Each block must be extracted from a peg with one hand, transferred to the other hand, and inserted in a peg at the other side of the board. All cases were acquired by a non-medical expert on the LTSI Laboratory from the University of Rennes. The data set was divided into a training data set composed of 90 cases and a test data set composed of 60 cases. A case was composed of kinematic data, a video, semantic segmentation of each frame, and workflow annotation.
3 PAPERS • 6 BENCHMARKS
UAV-GESTURE is a dataset for UAV control and gesture recognition. It is an outdoor recorded video dataset for UAV commanding signals with 13 gestures suitable for basic UAV navigation and command from general aircraft handling and helicopter handling signals. It contains 119 high-definition video clips consisting of 37,151 frames.
3 PAPERS • NO BENCHMARKS YET
Website: https://asankagp.github.io/droneaction/
2 PAPERS • 1 BENCHMARK
Largest, first-of-its-kind, in-the-wild, fine-grained workout/exercise posture analysis dataset, covering three different exercises: BackSquat, Barbell Row, and Overhead Press. Seven different types of exercise errors are covered. Unlabeled data is also provided to facilitate self-supervised learning.
2 PAPERS • NO BENCHMARKS YET
RISE is a large-scale video dataset for Recognizing Industrial Smoke Emissions. A citizen science approach was adopted to collaborate with local community members to annotate whether a video clip has smoke emissions. The dataset contains 12,567 clips from 19 distinct views from cameras that monitored three industrial facilities. These daytime clips span 30 days over two years, including all four seasons.
ANUBIS is a large-scale human skeleton dataset containing 80 actions. Compared with previously collected datasets, ANUBIS is advantageous in the following four aspects: (1) employing more recently released sensors; (2) containing novel back view; (3) encouraging high enthusiasm of subjects; (4) including actions of the COVID pandemic era.
1 PAPER • NO BENCHMARKS YET
Existing image/video datasets for cattle behavior recognition are mostly small, lack well-defined labels, or are collected in unrealistic controlled environments. This limits the utility of machine learning (ML) models learned from them. Therefore, we introduce a new dataset, called Cattle Visual Behaviors (CVB), that consists of 502 video clips, each fifteen seconds long, captured in natural lighting conditions, and annotated with eleven visually perceptible behaviors of grazing cattle. By creating and sharing CVB, our aim is to develop improved models capable of recognizing all important cattle behaviors accurately and to assist other researchers and practitioners in developing and evaluating new ML models for cattle behavior classification using video data. The dataset is presented in the form of following three sub-directories. 1. raw_frames: contains 450 frames in each sub folder representing a 15 second video taken at a frame rate of 30 FPS. 2. annotations: contains the json file
Understanding comprehensive assembly knowledge from videos is critical for futuristic ultra-intelligent industry. To enable technological breakthrough, we present HA-ViD – an assembly video dataset that features representative industrial assembly scenarios, natural procedural knowledge acquisition process, and consistent human-robot shared annotations. Specifically, HA-ViD captures diverse collaboration patterns of real-world assembly, natural human behaviors and learning progression during assembly, and granulate action annotations to subject, action verb, manipulated object, target object, and tool. We provide 3222 multi-view and multi-modality videos, 1.5M frames, 96K temporal labels and 2M spatial labels. We benchmark four foundational video understanding tasks: action recognition, action segmentation, object detection and multi-object tracking. Importantly, we analyze their performance and the further reasoning steps for comprehending knowledge in assembly progress, process effici
MPOSE2021, a dataset for real-time short-time HAR, suitable for both pose-based and RGB-based methodologies. It includes 15,429 sequences from 100 actors and different scenarios, with limited frames per scene (between 20 and 30). In contrast to other publicly available datasets, the peculiarity of having a constrained number of time steps stimulates the development of real-time methodologies that perform HAR with low latency and high throughput.