The Human3.6M dataset is one of the largest motion capture datasets, which consists of 3.6 million human poses and corresponding images captured by a high-speed motion capture system. There are 4 high-resolution progressive scan cameras to acquire video data at 50 Hz. The dataset contains activities by 11 professional actors in 17 scenarios: discussion, smoking, taking photo, talking on the phone, etc., as well as provides accurate 3D joint positions and high-resolution videos.
768 PAPERS • 16 BENCHMARKS
The Leeds Sports Pose (LSP) dataset is widely used as the benchmark for human pose estimation. The original LSP dataset contains 2,000 images of sportspersons gathered from Flickr, 1000 for training and 1000 for testing. Each image is annotated with 14 joint locations, where left and right joints are consistently labelled from a person-centric viewpoint. The extended LSP dataset contains additional 10,000 images labeled for training.
202 PAPERS • 1 BENCHMARK
For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real images. Hypersim is a photorealistic synthetic dataset for holistic indoor scene understanding. It contains 77,400 images of 461 indoor scenes with detailed per-pixel labels and corresponding ground truth geometry.
93 PAPERS • 5 BENCHMARKS
Dataset produced for the SAPIEN simulation environment. From the website: "PartNet-Mobility dataset is a collection of 2K articulated objects with motion annotations and rendernig material. The dataset powers research for generalizable computer vision and manipulation. The dataset is a continuation of ShapeNet and PartNet. "
75 PAPERS • NO BENCHMARKS YET
A hand-object interaction dataset with 3D pose annotations of hand and object. The dataset contains 66,034 training images and 11,524 test images from a total of 68 sequences. The sequences are captured in multi-camera and single-camera setups and contain 10 different subjects manipulating 10 different objects from YCB dataset. The annotations are automatically obtained using an optimization algorithm. The hand pose annotations for the test set are withheld and the accuracy of the algorithms on the test set can be evaluated with standard metrics using the CodaLab challenge submission(see project page). The object pose annotations for the test and train set are provided along with the dataset.
35 PAPERS • 2 BENCHMARKS
MuCo-3DHP is a large scale training data set showing real images of sophisticated multi-person interactions and occlusions.
35 PAPERS • NO BENCHMARKS YET
ApolloCar3DT is a dataset that contains 5,277 driving images and over 60K car instances, where each car is fitted with an industry-grade 3D CAD model with absolute model size and semantically labelled keypoints. This dataset is above 20 times larger than PASCAL3D+ and KITTI, the current state-of-the-art.
17 PAPERS • 14 BENCHMARKS
Contains bounding boxes for object detection, instance segmentation masks, keypoints for pose estimation and depth images for each synthetic scenery as well as images for each individual seat for classification.
15 PAPERS • NO BENCHMARKS YET
Multi-View Operating Room (MVOR) is a dataset recorded during real clinical interventions. It consists of 732 synchronized multi-view frames recorded by three RGB-D cameras in a hybrid OR. It also includes the visual challenges present in such environments, such as occlusions and clutter.
11 PAPERS • NO BENCHMARKS YET
Unite The People is a dataset for 3D body estimation. The images come from the Leeds Sports Pose dataset and its extended version, as well as the single person tagged people from the MPII Human Pose Dataset. The images are labeled with different types of annotations such as segmentation labels, pose or 3D.
10 PAPERS • NO BENCHMARKS YET
UnrealEgo is a dataset that provides in-the-wild stereo images with a large variety of motions for 3D human pose estimation. The in-the-wild stereo images are stereo fisheye images and depth maps with a resolution of 1024×1024 pixels each with 25 frames per second and a total of 450k (900k images) are captured for the dataset. Metadata is provided for each frame, including 3D joint positions, camera positions, and 2D coordinates of reprojected joint positions in the fisheye views.
9 PAPERS • 1 BENCHMARK
We provide manual annotations of 14 semantic keypoints for 100,000 car instances (sedan, suv, bus, and truck) from 53,000 images captured from 18 moving cameras at Multiple intersections in Pittsburgh, PA. Please fill the google form to get a email with the download links:
8 PAPERS • 2 BENCHMARKS
HUMAN4D is a large and multimodal 4D dataset that contains a variety of human activities simultaneously captured by a professional marker-based MoCap, a volumetric capture and an audio recording system. By capturing 2 female and $2$ male professional actors performing various full-body movements and expressions, HUMAN4D provides a diverse set of motions and poses encountered as part of single- and multi-person daily, physical and social activities (jumping, dancing, etc. ), along with multi-RGBD (mRGBD), volumetric and audio data.
8 PAPERS • NO BENCHMARKS YET
multi-view imagery of people interacting with a variety of rich 3D environments
7 PAPERS • 2 BENCHMARKS
Enhancing the robustness of vision algorithms in real-world scenarios is challenging. One reason is that existing robustness benchmarks are limited, as they either rely on synthetic data or ignore the effects of individual nuisance factors. We introduce OOD-CV, a benchmark dataset that includes out-of-distribution examples of 10 object categories in terms of pose, shape, texture, context and the weather conditions, and enables benchmarking models for image classification, object detection, and 3D pose estimation. In addition to this novel dataset, we contribute extensive experiments using popular baseline methods, which reveal that: 1. Some nuisance factors have a much stronger negative effect on the performance compared to others, also depending on the vision task. 2. Current approaches to enhance robustness have only marginal effects, and can even reduce robustness. 3. We do not observe significant differences between convolutional and transformer architectures. We believe our datase
6 PAPERS • 1 BENCHMARK
We introduce HARPER, a novel dataset for 3D body pose estimation and forecast in dyadic interactions between users and \spot, the quadruped robot manufactured by Boston Dynamics. The key-novelty is the focus on the robot's perspective, i.e., on the data captured by the robot's sensors. These make 3D body pose analysis challenging because being close to the ground captures humans only partially. The scenario underlying HARPER includes 15 actions, of which 10 involve physical contact between the robot and users. The Corpus contains not only the recordings of the built-in stereo cameras of Spot, but also those of a 6-camera OptiTrack system (all recordings are synchronized). This leads to ground-truth skeletal representations with a precision lower than a millimeter. In addition, the Corpus includes reproducible benchmarks on 3D Human Pose Estimation, Human Pose Forecasting, and Collision Prediction, all based on publicly available baseline approaches. This enables future HARPER users to
5 PAPERS • 3 BENCHMARKS
PedX is a large-scale multi-modal collection of pedestrians at complex urban intersections. The dataset provides high-resolution stereo images and LiDAR data with manual 2D and automatic 3D annotations. The data was captured using two pairs of stereo cameras and four Velodyne LiDAR sensors.
5 PAPERS • NO BENCHMARKS YET
Accurate 3D human pose estimation is essential for sports analytics, coaching, and injury prevention. However, existing datasets for monocular pose estimation do not adequately capture the challenging and dynamic nature of sports movements. In response, we introduce SportsPose, a large-scale 3D human pose dataset consisting of highly dynamic sports movements. With more than 176,000 3D poses from 24 different subjects performing 5 different sports activities, SportsPose provides a diverse and comprehensive set of 3D poses that reflect the complex and dynamic nature of sports movements. Contrary to other markerless datasets we have quantitatively evaluated the precision of SportsPose by comparing our poses with a commercial marker-based system and achieve a mean error of 34.5 mm across all evaluation sequences. This is comparable to the error reported on the commonly used 3DPW dataset. We further introduce a new metric, local movement, which describes the movement of the wrist and ankle
4 PAPERS • NO BENCHMARKS YET
HSPACE (Human-SPACE) is a large-scale photo-realistic dataset of animated humans placed in complex synthetic indoor and outdoor environments. For all frames the dataset provides 3d pose and shape ground truth, as well as other rich image annotations including human segmentation, body part localisation semantics, and temporal correspondences.
3 PAPERS • 1 BENCHMARK
Includes 100K depth images under challenging scenarios.
3 PAPERS • 2 BENCHMARKS
Mirrored-Human is a dataset for 3D pose estimation from a single view. It covers a large variety of human subjects, poses and backgrounds. The images are collected from the internet and consists of people in front of mirrors, were both the person and the reflected image are visible. Actions cover dancing, fitness, mirror installation, swing practice
3 PAPERS • NO BENCHMARKS YET
This dataset presents a vision and perception research dataset collected in Rome, featuring RGB data, 3D point clouds, IMU, and GPS data. We introduce a new benchmark targeting visual odometry and SLAM, to advance the research in autonomous robotics and computer vision. This work complements existing datasets by simultaneously addressing several issues, such as environment diversity, motion patterns, and sensor frequency. It uses up-to-date devices and presents effective procedures to accurately calibrate the intrinsic and extrinsic of the sensors while addressing temporal synchronization. During recording, we cover multi-floor buildings, gardens, urban and highway scenarios. Combining handheld and car-based data collections, our setup can simulate any robot (quadrupeds, quadrotors, autonomous vehicles). The dataset includes an accurate 6-dof ground truth based on a novel methodology that refines the RTK-GPS estimate with LiDAR point clouds through Bundle Adjustment. All sequences divi
The dataset is designed specifically to solve a range of computer vision problems (2D-3D tracking, posture) faced by biologists while designing behavior studies with animals.
2 PAPERS • NO BENCHMARKS YET
DRACO20K dataset is used for evaluating object canonicalization on methods that estimate a canonical frame from a monocular input image.
Event-Human3.6m is a challenging dataset for event-based human pose estimation by simulating events from the RGB Human3.6m dataset. It is built by converting the RGB recordings of Human3.6m into events and synchronising raw joints ground-truth with events frames through interpolation.
Largest, first-of-its-kind, in-the-wild, fine-grained workout/exercise posture analysis dataset, covering three different exercises: BackSquat, Barbell Row, and Overhead Press. Seven different types of exercise errors are covered. Unlabeled data is also provided to facilitate self-supervised learning.
A Simulated Benchmark for multi-modal SLAM Systems Evaluation in Large-scale Dynamic Environments.
This dataset comprehends the 3D building information model (in IFC and Revit formats), manually elaborated based on the terrestrial laser scanner of the sequence 2 of ConSLAM, and the refined ground truth (GT) poses (in TUM format) of sessions 2, 3, 4, and 5 of the open-access ConSLAM dataset (which provides camera, LiDAR, and IMU measurements).
Accidental Turntables contains a challenging set of 41,212 images of cars in cluttered backgrounds, motion blur and illumination changes that serves as a benchmark for 3D pose estimation.
1 PAPER • NO BENCHMARKS YET
ConSLAM is a real-world dataset collected periodically on a construction site to measure the accuracy of mobile scanners' SLAM algorithms.
Estimating camera motion in deformable scenes poses a complex and open research challenge. Most existing non-rigid structure from motion techniques assume to observe also static scene parts besides deforming scene parts in order to establish an anchoring reference. However, this assumption does not hold true in certain relevant application cases such as endoscopies. To tackle this issue with a common benchmark, we introduce the Drunkard’s Dataset, a challenging collection of synthetic data targeting visual navigation and reconstruction in deformable environments. This dataset is the first large set of exploratory camera trajectories with ground truth inside 3D scenes where every surface exhibits non-rigid deformations over time. Simulations in realistic 3D buildings lets us obtain a vast amount of data and ground truth labels, including camera poses, RGB images and depth, optical flow and normal maps at high resolution and quality.
1 PAPER • 1 BENCHMARK
A new large-scale dataset that consists of 409 fine-grained categories and 31,881 images with accurate 3D pose annotation.
Human pose estimation (HPE) in the top-view using fisheye cameras presents a promising and innovative application domain. However, the availability of datasets capturing this viewpoint is extremely limited, especially those with high-quality 2D and 3D keypoint annotations. Addressing this gap, we leverage the capabilities of Neural Radiance Fields (NeRF) technique to establish a comprehensive pipeline for generating human pose datasets from existing 2D and 3D datasets, specifically tailored for the top-view fisheye perspective. Through this pipeline, we create a novel dataset NToP (NeRF-powered Top-view human Pose dataset for fisheye cameras) with over 570 thousand images, and conduct an extensive evaluation of its efficacy in enhancing neural networks for 2D and 3D top-view human pose estimation. Extensive evaluations on existing top-view 2D and 3D HPE datasets as well as our new real-world top-view 2D HPE dataset OmniLab prove that our dataset is effective and exceeds previous datase
The Store Dataset is a dataset for estimating 3D poses of multiple humans in real-time. It is captured inside two kinds of simulated stores with 12 and 28 cameras, respectively.
This is a pose estimation dataset, consisting of symmetric 3D shapes where multiple orientations are visually indistinguishable. The challenge is to predict all equivalent orientations when only one orientation is paired with each image during training (as is the scenario for most pose estimation datasets). In contrast to most pose estimation datasets, the full set of equivalent orientations is available for evaluation.
InfiniteRep is a synthetic, open-source dataset for fitness and physical therapy (PT) applications. It includes 1k videos of diverse avatars performing multiple repetitions of common exercises. It includes significant variation in the environment, lighting conditions, avatar demographics, and movement trajectories. From cadence to kinematic trajectory, each rep is done slightly differently -- just like real humans. InfiniteRep videos are accompanied by a rich set of pixel-perfect labels and annotations, including frame-specific repetition counts.
0 PAPER • NO BENCHMARKS YET