Data used for the paper SparsePoser: Real-time Full-body Motion Reconstruction from Sparse Data
1 PAPER • NO BENCHMARKS YET
Estimating camera motion in deformable scenes poses a complex and open research challenge. Most existing non-rigid structure from motion techniques assume to observe also static scene parts besides deforming scene parts in order to establish an anchoring reference. However, this assumption does not hold true in certain relevant application cases such as endoscopies. To tackle this issue with a common benchmark, we introduce the Drunkard’s Dataset, a challenging collection of synthetic data targeting visual navigation and reconstruction in deformable environments. This dataset is the first large set of exploratory camera trajectories with ground truth inside 3D scenes where every surface exhibits non-rigid deformations over time. Simulations in realistic 3D buildings lets us obtain a vast amount of data and ground truth labels, including camera poses, RGB images and depth, optical flow and normal maps at high resolution and quality.
1 PAPER • 1 BENCHMARK
The IMUPoser Dataset is a dataset for estimating body pose using IMUs already in devices that many users own -- namely smartphones, smartwatches, and earbuds.
4 PAPERS • NO BENCHMARKS YET
Accurate 3D human pose estimation is essential for sports analytics, coaching, and injury prevention. However, existing datasets for monocular pose estimation do not adequately capture the challenging and dynamic nature of sports movements. In response, we introduce SportsPose, a large-scale 3D human pose dataset consisting of highly dynamic sports movements. With more than 176,000 3D poses from 24 different subjects performing 5 different sports activities, SportsPose provides a diverse and comprehensive set of 3D poses that reflect the complex and dynamic nature of sports movements. Contrary to other markerless datasets we have quantitatively evaluated the precision of SportsPose by comparing our poses with a commercial marker-based system and achieve a mean error of 34.5 mm across all evaluation sequences. This is comparable to the error reported on the commonly used 3DPW dataset. We further introduce a new metric, local movement, which describes the movement of the wrist and ankle
3 PAPERS • NO BENCHMARKS YET
SLOPER4D is a novel scene-aware dataset collected in large urban environments to facilitate the research of global human pose estimation (GHPE) with human-scene interaction in the wild. It consists of 15 sequences of human motions, each of which has a trajectory length of more than 200 meters (up to 1,300 meters) and covers an area of more than 2,000 (up to 13,000), including more than 100K LiDAR frames, 300k video frames, and 500K IMU-based motion frames. With SLOPER4D, we provide a detailed and thorough analysis of two critical tasks, including camera-based 3D HPE and LiDAR-based 3D HPE in urban environments, and benchmark a new task, GHPE.
5 PAPERS • 1 BENCHMARK
Amateur Drawings is a dataset collected via the public demo of Animated Drawings, containing over 178,000 amateur drawings and corresponding user-accepted character bounding boxes, segmentation masks, and joint location annotations.
CHAIRS is a large-scale motion-captured f-AHOI dataset, consisting of 17.3 hours of versatile interactions between 46 participants and 81 articulated and rigid sittable objects. CHAIRS provides 3D meshes of both humans and articulated objects during the entire interactive process, as well as realistic and physically plausible full-body interactions.
Human3.6M 3D WholeBody (H3WB) is a large scale dataset with 133 whole-body keypoint annotations on 100K images, made possible by a new multi-view pipeline. It is designed for the three new tasks : i) 3D whole-body pose lifting from 2D complete whole-body pose, ii) 3D whole-body pose lifting from 2D incomplete whole-body pose, iii) 3D whole-body pose estimation from a single RGB image.
2 PAPERS • 3 BENCHMARKS
CORSMAL is a dataset for estimating the position and orientation in 3D (or 6D pose) of an object from a single view. The dataset consists of 138,240 images of rendered hands and forearms holding 48 synthetic objects, split into 3 grasp categories over 30 real backgrounds.
2 PAPERS • NO BENCHMARKS YET
Halpe-FullBody is a full body keypoints dataset where each person has annotated 136 keypoints, including 20 for body, 6 for feet, 42 for hands and 68 for face. It is designed for the task of whole body human pose estimation.
HuPR is a human pose estimation benchmark is created using cross-calibrated mmWave radar sensors and a monocular RGB camera for cross-modality training of radar-based human pose estimation. This dataset contains 235 sequences of data in an indoor environment, with each sequence being one-minute long and totalling about 4 hour-long video data.
Data used for the paper Combining Motion Matching and Orientation Prediction to Animate Avatars for Consumer-Grade VR Devices.
Dataset page: https://github.com/mosamdabhi/MBW-Data
UnrealEgo is a dataset that provides in-the-wild stereo images with a large variety of motions for 3D human pose estimation. The in-the-wild stereo images are stereo fisheye images and depth maps with a resolution of 1024×1024 pixels each with 25 frames per second and a total of 450k (900k images) are captured for the dataset. Metadata is provided for each frame, including 3D joint positions, camera positions, and 2D coordinates of reprojected joint positions in the fisheye views.
8 PAPERS • 1 BENCHMARK
BRACE is a dataset for audio-conditioned dance motion synthesis challenging common assumptions for this task:
5 PAPERS • 2 BENCHMARKS
MPHOI-72 is a multi-person human-object interaction dataset that can be used for a wide variety of HOI/activity recognition and pose estimation/object tracking tasks. The dataset is challenging due to many body occlusions among the humans and objects. It consists of 72 videos captured from 3 different angles at 30 fps, with totally 26,383 frames and an average length of 12 seconds. It involves 5 humans performing in pairs, 6 object types, 3 activities and 13 sub-activities. The dataset includes color video, depth video, human skeletons, human and object bounding boxes.
The Few-Shot Object Learning (FewSOL) dataset can be used for object recognition with a few images per object. It contains 336 real-world objects with 9 RGB-D images per object from different views. Object segmentation masks, object poses and object attributes are provided. In addition, synthetic images generated using 330 3D object models are used to augment the dataset. FewSOL dataset can be used to study a set of few-shot object recognition problems such as classification, detection and segmentation, shape reconstruction, pose estimation, keypoint correspondences and attribute recognition.
Throughout the history of art, the pose—as the holistic abstraction of the human body's expression—has proven to be a constant in numerous studies. However, due to the enormous amount of data that so far had to be processed by hand, its crucial role to the formulaic recapitulation of art-historical motifs since antiquity could only be highlighted selectively. This is true even for the now automated estimation of human poses, as domain-specific, sufficiently large data sets required for training computational models are either not publicly available or not indexed at a fine enough granularity. With the Poses of People in Art data set, we introduce the first openly licensed data set for estimating human poses in art and validating human pose estimators. It consists of 2,454 images from 22 art-historical depiction styles, including those that have increasingly turned away from lifelike representations of the body since the 19th century. A total of 10,749 human figures are precisely enclos
3 PAPERS • 1 BENCHMARK
MOTFront provides photo-realistic RGB-D images with their corresponding instance segmentation masks, class labels, 2D & 3D bounding boxes, 3D geometry, 3D poses and camera parameters. The MOTFront dataset comprises 2,381 unique indoor sequences with a total of 60,000 images and is based on the 3D-FRONT dataset.
Description: 10,000 People - Human Pose Recognition Data. This dataset includes indoor and outdoor scenes.This dataset covers males and females. Age distribution ranges from teenager to the elderly, the middle-aged and young people are the majorities. The data diversity includes different shooting heights, different ages, different light conditions, different collecting environment, clothes in different seasons, multiple human poses. For each subject, the labels of gender, race, age, collecting environment and clothes were annotated. The data can be used for human pose recognition and other tasks.
52 PAPERS • NO BENCHMARKS YET
Overview The goal: using simulation data to train neural networks to estimate the pose of a rover's camera with respect to a known target object
0 PAPER • NO BENCHMARKS YET
Animal Kingdom is a large and diverse dataset that provides multiple annotated tasks to enable a more thorough understanding of natural animal behaviors. The wild animal footage used in the dataset records different times of the day in an extensive range of environments containing variations in backgrounds, viewpoints, illumination and weather conditions. More specifically, the dataset contains 50 hours of annotated videos to localize relevant animal behavior segments in long videos for the video grounding task, 30K video sequences for the fine-grained multi-label action recognition task, and 33K frames for the pose estimation task, which correspond to a diverse range of animals with 850 species across 6 major animal classes.
12 PAPERS • 2 BENCHMARKS
The NVIDIA HOPE datasets consist of RGBD images and video sequences with labeled 6-DoF poses for 28 toy grocery objects. The toy grocery objects are readily available for purchase and have ideal size and weight for robotic manipulation. 3D textured meshes for generating synthetic training data are provided.
The HOPE-Video dataset contains 10 video sequences (2038 frames) with 5-20 objects on a tabletop scene captured by a robot arm-mounted RealSense D415 RGBD camera. In each sequence, the camera is moved to capture multiple views of a set of objects in the robotic workspace. First COLMAP was applied to refine the camera poses (keyframes at 6~fps) provided by forward kinematics and RGB calibration from RealSense to Baxter's wrist camera. 3D dense point cloud was then generated via CascadeStereo (included for each sequence in 'scene.ply'). Ground truth poses for the HOPE objects models in the world coordinate system were annotated manually using the CascadeStereo point clouds. The following are provided for each frame:
Largest, first-of-its-kind, in-the-wild, fine-grained workout/exercise posture analysis dataset, covering three different exercises: BackSquat, Barbell Row, and Overhead Press. Seven different types of exercise errors are covered. Unlabeled data is also provided to facilitate self-supervised learning.
The CIP dataset is composed of 2 subsets, containing low-cost (MPU9250) and high-end (MTwAwinda) Magnetic, Angular Rate, and Gravity (MARG) sensor data respectively. It provides data for the analysis of the complete inertial pose pipeline, from raw measurements, to sensor-to-segment calibration, multi-sensor fusion, skeleton kinematics, to the complete human pose. Multiple trials were collected with 21 and 10 subjects respectively, performing 6 types of movements (ranging from calibration, to daily-activities, range-of-motion and random). It presents a high degree of variability and complex dynamics while containing common sources of error found on real conditions. This amounts to 3.5M samples, synchronized with a ground-truth inertial motion capture system (Xsens) at 60hz. This dataset may contribute to assess, benchmark and develop novel algorithms for each of the pipelines' processing steps, with applications in classic or data-driven inertial pose estimation algorithms, human movem
InfiniteRep is a synthetic, open-source dataset for fitness and physical therapy (PT) applications. It includes 1k videos of diverse avatars performing multiple repetitions of common exercises. It includes significant variation in the environment, lighting conditions, avatar demographics, and movement trajectories. From cadence to kinematic trajectory, each rep is done slightly differently -- just like real humans. InfiniteRep videos are accompanied by a rich set of pixel-perfect labels and annotations, including frame-specific repetition counts.
HUMAN4D is a large and multimodal 4D dataset that contains a variety of human activities simultaneously captured by a professional marker-based MoCap, a volumetric capture and an audio recording system. By capturing 2 female and $2$ male professional actors performing various full-body movements and expressions, HUMAN4D provides a diverse set of motions and poses encountered as part of single- and multi-person daily, physical and social activities (jumping, dancing, etc. ), along with multi-RGBD (mRGBD), volumetric and audio data.
8 PAPERS • NO BENCHMARKS YET
The HumanoidRobotPose dataset is a dataset for real-time pose estimation of humanoid robots.
This is a pose estimation dataset, consisting of symmetric 3D shapes where multiple orientations are visually indistinguishable. The challenge is to predict all equivalent orientations when only one orientation is paired with each image during training (as is the scenario for most pose estimation datasets). In contrast to most pose estimation datasets, the full set of equivalent orientations is available for evaluation.
UAV-Human is a large dataset for human behavior understanding with UAVs. It contains 67,428 multi-modal video sequences and 119 subjects for action recognition, 22,476 frames for pose estimation, 41,290 frames and 1,144 identities for person re-identification, and 22,263 frames for attribute recognition. The dataset was collected by a flying UAV in multiple urban and rural districts in both daytime and nighttime over three months, hence covering extensive diversities w.r.t subjects, backgrounds, illuminations, weathers, occlusions, camera motions, and UAV flying attitudes. This dataset can be used for UAV-based human behavior understanding, including action recognition, pose estimation, re-identification, and attribute recognition.
36 PAPERS • 5 BENCHMARKS
The InterHand2.6M dataset is a large-scale real-captured dataset with accurate GT 3D interacting hand poses, used for 3D hand pose estimation The dataset contains 2.6M labeled single and interacting hand frames.
43 PAPERS • 2 BENCHMARKS
The MERL-RAV (MERL Reannotation of AFLW with Visibility) Dataset contains over 19,000 face images in a full range of head poses. Each face is manually labeled with the ground-truth locations of 68 landmarks, with the additional information of whether each landmark is unoccluded, self-occluded (due to extreme head poses), or externally occluded. The images were annotated by professional labelers, supervised by researchers at Mitsubishi Electric Research Laboratories (MERL).
2 PAPERS • 2 BENCHMARKS
The SMOT dataset, Single sequence-Multi Objects Training, is collected to represent a practical scenario of collecting training images of new objects in the real world, i.e. a mobile robot with an RGB-D camera collects a sequence of frames while driving around a table to learning multiple objects and tries to recognize objects in different locations.
The DREAM dataset is introduce by the paper "Camera-to-Robot Pose Estimation from a Single Image" (ICRA 2020). This dataset consists of synthetic images (both with and without domain randomlization) of three different robot manipulators (Franka Emika’s Panda, Kuka’s LBR iiwa 7 R800, and Rethink Robotics’ Baxter) , as well as real-world images of Franka Emika’s Panda taken from various RGBD cameras (XBox 360 Kinect (XK), RealSense (RS), and Azure Kinect (AK)). Each instance in the dataset contains an RGB image, keypoint 3D/2D coordinates , global camera-to-robot transformation and joint state configurations (from both revolute and prismatic joint) of the robot. Tasks like estimating robot pose (camera pose) from a single RGB image, camera-to-robot calibration can be conducted and evaluated in this dataset.
DensePose-Track is a dataset of videos where selected frames are annotated in the traditional DensePose manner.
multi-view imagery of people interacting with a variety of rich 3D environments
7 PAPERS • 2 BENCHMARKS
The CrowdPose dataset contains about 20,000 images and a total of 80,000 human poses with 14 labeled keypoints. The test set includes 8,000 images. The crowded images containing homes are extracted from MSCOCO, MPII and AI Challenger.
83 PAPERS • 2 BENCHMARKS
The Fraunhofer IPA Bin-Picking dataset is a large-scale dataset comprising both simulated and real-world scenes for various objects (potentially having symmetries) and is fully annotated with 6D poses. A pyhsics simulation is used to create scenes of many parts in bulk by dropping objects in a random position and orientation above a bin. Additionally, this dataset extends the Siléane dataset by providing more samples. This allows to e.g. train deep neural networks and benchmark the performance on the public Siléane dataset
The MannequinChallenge Dataset (MQC) provides in-the-wild videos of people in static poses while a hand-held camera pans around the scene. The dataset consists of three splits for training, validation and testing.
26 PAPERS • NO BENCHMARKS YET
Multi-View Operating Room (MVOR) is a dataset recorded during real clinical interventions. It consists of 732 synchronized multi-view frames recorded by three RGB-D cameras in a hybrid OR. It also includes the visual challenges present in such environments, such as occlusions and clutter.
9 PAPERS • NO BENCHMARKS YET
We provide manual annotations of 14 semantic keypoints for 100,000 car instances (sedan, suv, bus, and truck) from 53,000 images captured from 18 moving cameras at Multiple intersections in Pittsburgh, PA. Please fill the google form to get a email with the download links:
8 PAPERS • 2 BENCHMARKS
DensePose-COCO is a large-scale ground-truth dataset with image-to-surface correspondences manually annotated on 50K COCO images and train DensePose-RCNN, to densely regress part-specific UV coordinates within every human region at multiple frames per second.
227 PAPERS • NO BENCHMARKS YET
The 3D Poses in the Wild dataset is the first dataset in the wild with accurate 3D poses for evaluation. While other datasets outdoors exist, they are all restricted to a small recording volume. 3DPW is the first one that includes video footage taken from a moving phone camera.
331 PAPERS • 5 BENCHMARKS
The Pix3D dataset is a large-scale benchmark of diverse image-shape pairs with pixel-level 2D-3D alignment. Pix3D has wide applications in shape-related tasks including reconstruction, retrieval, viewpoint estimation, etc.
130 PAPERS • 5 BENCHMARKS
The YCB-Video dataset is a large-scale video dataset for 6D object pose estimation. provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames.
146 PAPERS • 6 BENCHMARKS
A large-scale dataset named AIC (AI Challenger) with three sub-datasets, human keypoint detection (HKD), large-scale attribute dataset (LAD) and image Chinese captioning (ICC).
12 PAPERS • 1 BENCHMARK
The EgoDexter dataset provides both 2D and 3D pose annotations for 4 testing video sequences with 3190 frames. The videos are recorded with body-mounted camera from egocentric viewpoints and contain cluttered backgrounds, fast camera motion, and complex interactions with various objects. Fingertip positions were manually annotated for 1485 out of 3190 frames.
10 PAPERS • NO BENCHMARKS YET