🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task

Filter by Language

50 dataset results for Actions

Breakfast (The Breakfast Actions Dataset)

The Breakfast Actions Dataset comprises of 10 actions related to breakfast preparation, performed by 52 different individuals in 18 different kitchens. The dataset is one of the largest fully annotated datasets available. The actions are recorded “in the wild” as opposed to a single controlled lab environment. It consists of over 77 hours of video recordings.

151 PAPERS • 5 BENCHMARKS

NExT-QA

NExT-QA is a VideoQA benchmark targeting the explanation of video contents. It challenges QA models to reason about the causal and temporal actions and understand the rich object interactions in daily activities. It supports both multi-choice and open-ended QA tasks. The videos are untrimmed and the questions usually invoke local video contents for answers.

74 PAPERS • 3 BENCHMARKS

SBU (SBU-Kinect-Interaction dataset v2.0)

SBU-Kinect-Interaction dataset version 2.0 comprises of RGB-D video sequences of humans performing interaction activities that are recording using the Microsoft Kinect sensor. This dataset was originally recorded for a class project, and it must be used only for the purposes of research. If you use this dataset in your work, please cite the following paper. Kiwon Yun, Jean Honorio, Debaleena Chattopadhyay, Tamara L. Berg, and Dimitris Samaras, The 2nd International Workshop on Human Activity Understanding from 3D Data at Conference on Computer Vision and Pattern Recognition (HAU3D-CVPRW), CVPR 2012

71 PAPERS • 4 BENCHMARKS

BEAT (Body-Expression-Audio-Text)

BEAT has i) 76 hours, high-quality, multi-modal data captured from 30 speakers talking with eight different emotions and in four different languages, ii) 32 millions frame-level emotion and semantic relevance annotations. Our statistical analysis on BEAT demonstrates the correlation of conversational gestures with \textit{facial expressions}, \textit{emotions}, and \textit{semantics}, in addition to the known correlation with \textit{audio}, \textit{text}, and \textit{speaker identity}. Based on this observation, we propose a baseline model, \textbf{Ca}scaded \textbf{M}otion \textbf{N}etwork \textbf{(CaMN)}, which consists of above six modalities modeled in a cascaded architecture for gesture synthesis. To evaluate the semantic relevancy, we introduce a metric, Semantic Relevance Gesture Recall (\textbf{SRGR}). Qualitative and quantitative experiments demonstrate metrics' validness, ground truth data quality, and baseline's state-of-the-art performance. To the best of our knowledge,

37 PAPERS • 1 BENCHMARK

KIT Motion-Language

The KIT Motion-Language is a dataset linking human motion and natural language.

32 PAPERS • 2 BENCHMARKS

V-D4RL

V-D4RL provides pixel-based analogues of the popular D4RL benchmarking tasks, derived from the dm_control suite, along with natural extensions of two state-of-the-art online pixel-based continuous control algorithms, DrQ-v2 and DreamerV2, to the offline setting.

9 PAPERS • NO BENCHMARKS YET

Atari-HEAD

Atari-HEAD is a dataset of human actions and eye movements recorded while playing Atari videos games. For every game frame, its corresponding image frame, the human keystroke action, the reaction time to make that action, the gaze positions, and immediate reward returned by the environment were recorded. The gaze data was recorded using an EyeLink 1000 eye tracker at 1000Hz. The human subjects are amateur players who are familiar with the games. The human subjects were only allowed to play for 15 minutes and were required to rest for at least 15 minutes before the next trial. Data was collected from 4 subjects, 16 games, 175 15-minute trials, and a total of 2.97 million frames/demonstrations.

7 PAPERS • NO BENCHMARKS YET

UI-PRMD (University of Idaho – Physical Rehabilitation Movement Dataset)

UI-PRMD is a data set of movements related to common exercises performed by patients in physical therapy and rehabilitation programs. The data set consists of 10 rehabilitation exercises. A sample of 10 healthy individuals repeated each exercise 10 times in front of two sensory systems for motion capturing: a Vicon optical tracker, and a Kinect camera. The data is presented as positions and angles of the body joints in the skeletal models provided by the Vicon and Kinect mocap systems.

7 PAPERS • 2 BENCHMARKS

BRACE (The Breakdancing Competition Dataset for Dance Motion Synthesis)

BRACE is a dataset for audio-conditioned dance motion synthesis challenging common assumptions for this task:

5 PAPERS • 2 BENCHMARKS

Sims4Action

The Sims4Action Dataset: a videogame-based dataset for Synthetic→Real domain adaptation for human activity recognition.

5 PAPERS • NO BENCHMARKS YET

DoMSEV (Dataset of Multimodal Semantic Egocentric Video)

The Dataset of Multimodal Semantic Egocentric Video (DoMSEV) contains 80-hours of multimodal (RGB-D, IMU, and GPS) data related to First-Person Videos with annotations for recorder profile, frame scene, activities, interaction, and attention.

4 PAPERS • NO BENCHMARKS YET

HowTo100M Adverbs

HowTo100M Adverbs is a subset from HowTo100M with mined adverbs from 83 tasks in HowTo100M. The annotations were obtained from automatically transcribed narrations of instructional videos. The dataset contains originally 5,824 clips annotated with action-adverb pairs from 72 verbs and 6 adverbs. Source: How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs

4 PAPERS • 1 BENCHMARK

eSports Sensors Dataset

The eSports Sensors dataset contains sensor data collected from 10 players in 22 matches in League of Legends. The sensor data collected includes:

4 PAPERS • 2 BENCHMARKS

ActivityNet Adverbs

ActivityNet Adverbs is a subset from the ActivityNet dataset with extracted verb-adverb annotations. ActivityNet Adverbs contains 20 adverbs appearing across 114 actions, forming 643 unique action-adverb pairs in 3,099 video clips.

3 PAPERS • 2 BENCHMARKS

MSR-VTT Adverbs

MSR-VTT Adverbs is a subset from MSR-VTT with extracted verb-adverb annotations. MSR-VTT Adverbs contains 18 adverbs appearing across 106 actions, forming 464 unique action-adverb pairs in 1,824 video clips.

3 PAPERS • 2 BENCHMARKS

Motion Policy Networks

This dataset contains a large set (~3.2 Million) of high quality expert trajectories generated from a geometrically consist hybrid planner in a wide variety of environment (~575,000 environments). We created this dataset to explore the capabilities of neural networks to learn complex robotic motion, mimicking a traditional planner.

3 PAPERS • NO BENCHMARKS YET

OREBA (Objectively Recognizing Eating Behavior and Associated Intake)

The OREBA dataset aims to provide a comprehensive multi-sensor recording of communal intake occasions for researchers interested in automatic detection of intake gestures. Two scenarios are included, with 100 participants for a discrete dish and 102 participants for a shared dish, totalling 9069 intake gestures. Available sensor data consists of synchronized frontal video and IMU with accelerometer and gyroscope for both hands.

3 PAPERS • NO BENCHMARKS YET

SportsPose

SportsPose (SportsPose - A Dynamic 3D sports pose dataset)

Accurate 3D human pose estimation is essential for sports analytics, coaching, and injury prevention. However, existing datasets for monocular pose estimation do not adequately capture the challenging and dynamic nature of sports movements. In response, we introduce SportsPose, a large-scale 3D human pose dataset consisting of highly dynamic sports movements. With more than 176,000 3D poses from 24 different subjects performing 5 different sports activities, SportsPose provides a diverse and comprehensive set of 3D poses that reflect the complex and dynamic nature of sports movements. Contrary to other markerless datasets we have quantitatively evaluated the precision of SportsPose by comparing our poses with a commercial marker-based system and achieve a mean error of 34.5 mm across all evaluation sequences. This is comparable to the error reported on the commonly used 3DPW dataset. We further introduce a new metric, local movement, which describes the movement of the wrist and ankle

3 PAPERS • NO BENCHMARKS YET

VATEX Adverbs

VATEX Adverbs is a subset from VATEX with extracted verb-adverb annotations. VATEX Adverbs contains 34 adverbs appearing across 135 actions, forming 1,550 unique action-adverb pairs in 14,617 video clips.

3 PAPERS • 2 BENCHMARKS

3DYoga90 (3DYoga90: A Hierarchical Video Dataset for Yoga Pose Understanding)

3DYoga90 is organized within a three-level label hierarchy. It stands out as one of the most comprehensive open datasets, featuring the largest collection of RGB videos and 3D skeleton sequences among publicly available resources.

2 PAPERS • NO BENCHMARKS YET

RLU

RLU (RL Unplugged)

RL Unplugged is suite of benchmarks for offline reinforcement learning. The RL Unplugged is designed around the following considerations: to facilitate ease of use, we provide the datasets with a unified API which makes it easy for the practitioner to work with all data in the suite once a general pipeline has been established. This is a dataset accompanying the paper RL Unplugged: Benchmarks for Offline Reinforcement Learning.

2 PAPERS • NO BENCHMARKS YET

StarData

StarData is a StarCraft: Brood War replay dataset, with 65,646 games. The full dataset after compression is 365 GB, 1535 million frames, and 496 million player actions. The entire frame data was dumped out at 8 frames per second.

2 PAPERS • NO BENCHMARKS YET

TI1K Dataset (Thumb Index 1000 Hand & Fingertip Detection Dataset)

Thumb Index 1000 (TI1K) is a dataset of 1000 hand images with the hand bounding box, and thumb and index fingertip positions. The dataset includes the natural movement of the thumb and index fingers making it suitable for mixed reality (MR) applications.

2 PAPERS • NO BENCHMARKS YET

UESTC-MMEA-CL (A multi-modal egocentric activity dataset for continual learning)

UESTC-MMEA-CL is a new multi-modal activity dataset for continual egocentric activity recognition, which is proposed to promote future studies on continual learning for first-person activity recognition in wearable applications. Our dataset provides not only vision data with auxiliary inertial sensor data but also comprehensive and complex daily activity categories for the purpose of continual learning research. UESTC-MMEA-CL comprises 30.4 hours of fully synchronized first-person video clips, acceleration stream and gyroscope data in total. There are 32 activity classes in the dataset and each class contains approximately 200 samples. We divide the samples of each class into the training set, validation set and test set according to the ratio of 7:2:1. For the continual learning evaluation, we present three settings of incremental steps, i.e., the 32 classes are divided into {16, 8, 4} incremental steps and each step contains {2, 4, 8} activity classes, respectively.

2 PAPERS • NO BENCHMARKS YET

VGGSound-Sparse

The dataset uses VGG-Sound which consists of 10s clips collected from YouTube for 309 sound classes. A subset of ‘temporally sparse’ classes is selected using the following procedure: 5–15 videos are randomly picked from each of the 309 VGGSound classes, and manually annotated as to whether audio-visual cues are only sparsely available. As a result, 12 classes are selected (∼4 %) or 6.5k and 0.6k videos in the train and test sets, respectively. The classes include 'dog barking', 'chopping wood', 'lion roaring', 'skateboarding' etc.

2 PAPERS • NO BENCHMARKS YET

WebLINX (Real-World Website Navigation with Multi-Turn)

WebLINX is a large-scale benchmark of 100K interactions across 2300 expert demonstrations of conversational web navigation. It covers a broad range of patterns on over 150 real-world websites and can be used to train and evaluate agents in diverse scenarios.

2 PAPERS • 1 BENCHMARK

maze-dataset

This package provides utilities for generation, filtering, solving, visualizing, and processing of mazes for training ML systems. Primarily built for the maze-transformer interpretability project. You can find our paper on it here: http://arxiv.org/abs/2309.10498

2 PAPERS • NO BENCHMARKS YET

Baxter-UR5_95-Objects

In this dataset two robots, Baxter and UR5, perform 8 behaviors (look, grasp, pick, hold, shake, lower, drop, and push) on 95 objects that vary by 5 color (blue, green, red, white, and yellow), 6 contents (wooden button, plastic dices, glass marbles, nuts & bolts, pasta, and rice), and 4 weights (empty, 50g, 100g, and 150g). There are 90 objects with contents (5 colors x 3 weights x 6 contents) and 5 objects without any content that only vary by 5 colors. Both robots perform 5 trials on each object, resulting in 7,600 interactions (2 robots x 8 behaviors x 95 objects x 5 trials

1 PAPER • NO BENCHMARKS YET

Bus Trajectory Dataset

This dataset contains the bus trajectory dataset collected by 6 volunteers who were asked to travel across the sub-urban city of Durgapur, India, on intra-city buses (route name: 54 Feet). During the travel, the volunteers captured sensor logs through an Android application installed on COTS smartphones.

1 PAPER • NO BENCHMARKS YET

CP2A dataset (CARLA Pedestrian Action Anticipation dataset)

We present a new simulated dataset for pedestrian action anticipation collected using the CARLA simulator. To generate this dataset, we place a camera sensor on the ego-vehicle in the Carla environment and set the parameters to those of the camera used to record the PIE dataset (i.e., 1920x1080, 110° FOV). Then, we compute bounding boxes for each pedestrian interacting with the ego vehicle as seen through the camera's field of view. We generated the data in two urban environments available in the CARLA simulator: Town02 and Town03.

1 PAPER • NO BENCHMARKS YET

CVB (Video Dataset of Cattle Visual Behaviors)

Existing image/video datasets for cattle behavior recognition are mostly small, lack well-defined labels, or are collected in unrealistic controlled environments. This limits the utility of machine learning (ML) models learned from them. Therefore, we introduce a new dataset, called Cattle Visual Behaviors (CVB), that consists of 502 video clips, each fifteen seconds long, captured in natural lighting conditions, and annotated with eleven visually perceptible behaviors of grazing cattle. By creating and sharing CVB, our aim is to develop improved models capable of recognizing all important cattle behaviors accurately and to assist other researchers and practitioners in developing and evaluating new ML models for cattle behavior classification using video data. The dataset is presented in the form of following three sub-directories. 1. raw_frames: contains 450 frames in each sub folder representing a 15 second video taken at a frame rate of 30 FPS. 2. annotations: contains the json file

1 PAPER • NO BENCHMARKS YET

CY101 Dataset

In this dataset an uppertorso humanoid robot with 7-DOF arm explored 100 different objects belonging to 20 different categories using 10 behaviors: Look, Crush, Grasp, Hold, Lift, Drop, Poke, Push, Shake and Tap.

1 PAPER • NO BENCHMARKS YET

Capture-24

This dataset contains Axivity AX3 wrist-worn activity tracker data that were collected from 151 participants in 2014-2016 around the Oxfordshire area. Participants were asked to wear the device in daily living for a period of roughly 24 hours, amounting to a total of almost 4,000 hours. Vicon Autograph wearable cameras and Whitehall II sleep diaries were used to obtain the ground truth activities performed during the period (e.g. sitting watching TV, walking the dog, washing dishes, sleeping), resulting in more than 2,500 hours of labelled data. Accompanying code to analyse this data is available at https://github.com/activityMonitoring/capture24. The following papers describe the data collection protocol in full: i.) Gershuny J, Harms T, Doherty A, Thomas E, Milton K, Kelly P, Foster C (2020) Testing self-report time-use diaries against objective instruments in real time. Sociological Methodology doi: 10.1177/0081175019884591; ii.) Willetts M, Hollowell S, Aslett L, Holmes C, Doherty

1 PAPER • NO BENCHMARKS YET

HA-ViD (HA-ViD: A Human Assembly Video Dataset)

Understanding comprehensive assembly knowledge from videos is critical for futuristic ultra-intelligent industry. To enable technological breakthrough, we present HA-ViD – an assembly video dataset that features representative industrial assembly scenarios, natural procedural knowledge acquisition process, and consistent human-robot shared annotations. Specifically, HA-ViD captures diverse collaboration patterns of real-world assembly, natural human behaviors and learning progression during assembly, and granulate action annotations to subject, action verb, manipulated object, target object, and tool. We provide 3222 multi-view and multi-modality videos, 1.5M frames, 96K temporal labels and 2M spatial labels. We benchmark four foundational video understanding tasks: action recognition, action segmentation, object detection and multi-object tracking. Importantly, we analyze their performance and the further reasoning steps for comprehending knowledge in assembly progress, process effici

1 PAPER • NO BENCHMARKS YET

LARa

LARa (Logistic Activity Recognition Challenge)

LARa is the first freely accessible logistics-dataset for human activity recognition. In the ’Innovationlab Hybrid Services in Logistics’ at TU Dortmund University, two picking and one packing scenarios with 14 subjects were recorded using OMoCap, IMUs, and an RGB camera. 758 minutes of recordings were labeled by 12 annotators in 474 person-hours. The subsequent revision was carried out by 4 revisers in 143 person-hours. All the given data have been labeled and categorised into 8 activity classes and 19 binary coarse-semantic descriptions, also called attributes.

1 PAPER • NO BENCHMARKS YET

LEARNING STYLE IDENTIFICATION

LEARNING STYLE IDENTIFICATION (Learning Style Identification Using Semi-Supervised Self-Taught Labeling)

The dataset was collected from two courses offered on the University of Jordan's E-learning Portal during the second semester of 2020, namely "Computer Skills for Humanities Students" (CSHS) and "Computer Skills for Medical Students" (CSMS). Over the sixteen-week duration of each course, students participated in various activities such as reading materials, video lectures, assignments, and quizzes. To preserve student privacy, the log activity of each student was anonymized. Data was aggregated from multiple sources, including the Moodle learning management system and the student information system, and consolidated into a single database. The dataset contains information on the number of learners and events for each course, as well as their launch and end dates. CSHS had 1749 learners and 1,139,810 events from January 21, 2020 to May 20, 2020, while CSMS had 564 learners and 484,410 events during the same period. The dataset is based on the Filder and Silverman learning style model (F

1 PAPER • NO BENCHMARKS YET

MiniWob++

MiniWob++ is a suite of web-browser based tasks introduced in Liu et al. (2018) (an extension of the earlier MiniWob task suite (Shi et al., 2017)). Tasks range from simple button clicking to complex form-filling, for example, to book a flight when given particular instructions (Fig. 1a). Programmatic rewards are available for each task, permitting standard reinforcement learning techniques.

1 PAPER • NO BENCHMARKS YET

MotionID: IMU all motions part1 (Motion Patterns Identification)

Dataset (part 1/3) for Motion Patterns Identification part of MotionID: Human Authentication Approach. Data type: bin (should be converted by attached notebook).

1 PAPER • NO BENCHMARKS YET

MotionID: IMU all motions part2 (Motion Patterns Identification)

Dataset (part 2/3) for Motion Patterns Identification part of MotionID: Human Authentication Approach. Data type: bin (should be converted by attached notebook).

1 PAPER • NO BENCHMARKS YET

MotionID: IMU all motions part3 (Motion Patterns Identification)

Dataset (part 3/3) for Motion Patterns Identification part of MotionID: Human Authentication Approach. Data type: bin (should be converted by attached notebook).

1 PAPER • NO BENCHMARKS YET

MotionID: IMU specific motion (User verification)

Dataset for User Verification part of MotionID: Human Authentication Approach. Data type: bin (should be converted by attached notebook). ~50 hours of IMU (Inertial Measurement Units) data for one specific motion pattern, provided by 101 users.

1 PAPER • NO BENCHMARKS YET

SDN (Situated Dialogue Navigation)

Situated Dialogue Navigation (SDN) is a navigation benchmark of 183 trials with a total of 8415 utterances, around 18.7 hours of control streams, and 2.9 hours of trimmed audio. SDN is developed to evaluate the agent's ability to predict dialogue moves from humans as well as generate its own dialogue moves and physical navigation actions.

1 PAPER • NO BENCHMARKS YET

Two4Two

Two4Two (A Synthetic Dataset For Controlled Experiments)

Two4Two is a library to create synthetic image data crafted for human evaluations of interpretable ML approaches (esp. image classification). The synthetic images show two abstract animals: Peaky (arms inwards) and Stretchy (arms outwards). They are similar-looking, abstract animals, made of eight blocks. The core functionality of this library is that one can correlate different parameters with an animal type to create bias in the data.

1 PAPER • NO BENCHMARKS YET

UR5 Tool Dataset

In this dataset UR5 robot used 6 tools: metal-scissor, metal-whisk, plastic-knife, plastic-spoon, wooden-chopstick, and wooden-fork to perform 6 behaviors: look, stirring-slow, stirring-fast, stirring-twist, whisk, and poke. The robot explored 15 objects: cane-sugar, chia-seed, chickpea, detergent, empty, glass-bead, kidney-bean, metal-nut-bolt, plastic-bead, salt, split-green-pea, styrofoam-bead, water, wheat, and wooden-button kept cylindrical containers. The robot performed 10 trials on each object using a tool, resulting in 5,400 interactions (6 tools x 6 behaviors x 15 objects x 10 trials). The robot records multiple sensory data (audio, RGB images, depth images, haptic, and touch images) while interacting with the objects.

1 PAPER • NO BENCHMARKS YET

VFD-2000

VFD-2000 is a video fight detection dataset containing more than 2000 videos. YouTube is the data source. Specific scenarios are searched using “fight” as a search keyword, for example, “street fight”, “beach fight”, and “violence in the restaurant”. 200 videos under 20 different scenes are collected.

1 PAPER • NO BENCHMARKS YET

Visuomotor affordance learning (VAL) robot interaction dataset

This data contains about 2500 trajectories (with images and actions) of a Sawyer robot interacting with various objects.

1 PAPER • NO BENCHMARKS YET

Volunteer task execution events in Galaxy Zoo and The Milky Way citizen science projects

Context of the data sets The Zooniverse platform (www.zooniverse.org) has successfully built a large community of volunteers contributing to citizen science projects. Galaxy Zoo and the Milky Way Project were hosted there.

1 PAPER • NO BENCHMARKS YET

l2d (Learning to Dance)

This dataset is composed of paired videos of people dancing 3 different music styles: Ballet, Michael Jackson and Salsa. It contains multimodal data (visual data, temporal-graphs and audio) careful-selected from publicly available videos of dancers performing representative movements of the music style and audio data from the respective styles.

1 PAPER • NO BENCHMARKS YET

Datasets

50 dataset results for Actions