The YCB-Video dataset is a large-scale video dataset for 6D object pose estimation. provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames.
161 PAPERS • 5 BENCHMARKS
T-LESS is a dataset for estimating the 6D pose, i.e. translation and rotation, of texture-less rigid objects. The dataset features thirty industry-relevant objects with no significant texture and no discriminative color or reflectance properties. The objects exhibit symmetries and mutual similarities in shape and/or size. Compared to other datasets, a unique property is that some of the objects are parts of others. The dataset includes training and test images that were captured with three synchronized sensors, specifically a structured-light and a time-of-flight RGB-D sensor and a high-resolution RGB camera. There are approximately 39K training and 10K test images from each sensor. Additionally, two types of 3D models are provided for each object, i.e. a manually created CAD model and a semi-automatically reconstructed one. Training images depict individual objects against a black background. Test images originate from twenty test scenes having varying complexity, which increases from
90 PAPERS • 2 BENCHMARKS
REAL275 is a benchmark for category-level pose estimation. It contains 4300 training frames, 950 validation and 2750 for testing across 18 different real scenes.
65 PAPERS • 1 BENCHMARK
The LM (Linemod) dataset is a valuable resource introduced by Stefan Hinterstoisser and colleagues in their research on model-based training, detection, and pose estimation of texture-less 3D objects in heavily cluttered scenes¹. Let's delve into the details:
34 PAPERS • 5 BENCHMARKS
Estimating camera motion in deformable scenes poses a complex and open research challenge. Most existing non-rigid structure from motion techniques assume to observe also static scene parts besides deforming scene parts in order to establish an anchoring reference. However, this assumption does not hold true in certain relevant application cases such as endoscopies. To tackle this issue with a common benchmark, we introduce the Drunkard’s Dataset, a challenging collection of synthetic data targeting visual navigation and reconstruction in deformable environments. This dataset is the first large set of exploratory camera trajectories with ground truth inside 3D scenes where every surface exhibits non-rigid deformations over time. Simulations in realistic 3D buildings lets us obtain a vast amount of data and ground truth labels, including camera poses, RGB images and depth, optical flow and normal maps at high resolution and quality.
2 PAPERS • 1 BENCHMARK
This dataset comprehends the 3D building information model (in IFC and Revit formats), manually elaborated based on the terrestrial laser scanner of the sequence 2 of ConSLAM, and the refined ground truth (GT) poses (in TUM format) of sessions 2, 3, 4, and 5 of the open-access ConSLAM dataset (which provides camera, LiDAR, and IMU measurements).
2 PAPERS • NO BENCHMARKS YET
UW Indoor Scenes (UW-IS) Occluded dataset is curated using commodity hardware (Intel RealSense D435) to reflect real world robotics scenarios. It consists of two completely different indoor environments. The first environment is a lounge where the objects are placed on a tabletop. The second environment is a mock warehouse setup where the objects are placed on a shelf. For each of these environments, we have RGB-D images from 36 videos comprising five to seven objects each, taken from distances up to approximately 2m. The videos cover two different lighting conditions, three different levels of object separation for three different object categories (i.e., kitchen objects, food items, and tools/miscellaneous). The first level of object separation is such that there is no object occlusion. The second level of object separation is such that some occlusion occurs, while the third level is where the objects are placed extremely close together. Overall, the dataset considers 20 object class
The dataset includes the synthetic data generated from rendering the 3D meshes of LM objects and several household objects in Blender for training 6D pose estimation algorithms. The whole dataset contains synthetic data for 18 objects (13 from LM and 5 from household objects), with 20,000 data samples for each object. Each data sample includes an RGB image in .png format and a depth image in .exr format. Each sample has the annotations of mask labels in .png format and the ground truth pose labels saved in .json files. Apart from the training data, the 3D meshes of the objects and the pre-trained models of the 6D pose estimation algorithm are also included. The whole dataset takes approximately ~1T of storage memory.
1 PAPER • NO BENCHMARKS YET
The YCB-Ev dataset contains synchronized RGB-D frames and event data that enables evaluating 6DoF object pose estimation algorithms using these modalities. This dataset provides ground truth 6DoF object poses for the same 21 YCB objects that were used in the YCB-Video (YCB-V) dataset, allowing for cross-dataset algorithm performance evaluation. The dataset consists of 21 synchronized event and RGB-D sequences, totalling 13,851 frames (7 minutes and 43 seconds of event data). Notably, 12 of these sequences feature the same object arrangement as the YCB-V subset used in the BOP challenge.