HOPE-Video (Household Objects for Pose Estimation)

Introduced by Tyree et al. in 6-DoF Pose Estimation of Household Objects for Robotic Manipulation: An Accessible Dataset and Benchmark

The HOPE-Video dataset contains 10 video sequences (2038 frames) with 5-20 objects on a tabletop scene captured by a robot arm-mounted RealSense D415 RGBD camera. In each sequence, the camera is moved to capture multiple views of a set of objects in the robotic workspace. First COLMAP was applied to refine the camera poses (keyframes at 6~fps) provided by forward kinematics and RGB calibration from RealSense to Baxter's wrist camera. 3D dense point cloud was then generated via CascadeStereo (included for each sequence in 'scene.ply'). Ground truth poses for the HOPE objects models in the world coordinate system were annotated manually using the CascadeStereo point clouds. The following are provided for each frame:

Camera intrinsics/extrinsics RGB images of 640x480 Depth images of 640x480 3D scene reconstruction from CascadeStereo *Object pose annotation in the camera frame

Objects consist of a set of 28 toy grocery items selected for compatibility with robot manipulation and widespread availability. Textured models were generated by an EinScan-SE 3D Scanner, units were converted to centimeters, and the centers/rotations of the meshes were aligned to a canonical pose.


Paper Code Results Date Stars

Dataset Loaders

No data loaders found. You can submit your data loader here.


Similar Datasets


  • Unknown