ScanNet is an instance-level indoor RGB-D dataset that includes both 2D and 3D data. It is a collection of labeled voxels rather than points or objects. Up to now, ScanNet v2, the newest version of ScanNet, has collected 1513 annotated scans with an approximate 90% surface coverage. In the semantic segmentation task, this dataset is marked in 20 classes of annotated 3D voxelized objects.
960 PAPERS • 16 BENCHMARKS
The NYU-Depth V2 data set is comprised of video sequences from a variety of indoor scenes as recorded by both the RGB and Depth cameras from the Microsoft Kinect. It features:
733 PAPERS • 19 BENCHMARKS
The SUN RGBD dataset contains 10335 real RGB-D images of room scenes. Each RGB image has a corresponding depth and segmentation map. As many as 700 object categories are labeled. The training and testing sets contain 5285 and 5050 images, respectively.
364 PAPERS • 13 BENCHMARKS
NTU RGB+D is a large-scale dataset for RGB-D human action recognition. It involves 56,880 samples of 60 action classes collected from 40 subjects. The actions can be generally divided into three categories: 40 daily actions (e.g., drinking, eating, reading), nine health-related actions (e.g., sneezing, staggering, falling down), and 11 mutual actions (e.g., punching, kicking, hugging). These actions take place under 17 different scene conditions corresponding to 17 video sequences (i.e., S001–S017). The actions were captured using three cameras with different horizontal imaging viewpoints, namely, −45∘,0∘, and +45∘. Multi-modality information is provided for action characterization, including depth maps, 3D skeleton joint position, RGB frames, and infrared sequences. The performance evaluation is performed by a cross-subject test that split the 40 subjects into training and test groups, and by a cross-view test that employed one camera (+45∘) for testing, and the other two cameras for
360 PAPERS • 18 BENCHMARKS
The Matterport3D dataset is a large RGB-D dataset for scene understanding in indoor environments. It contains 10,800 panoramic views inside 90 real building-scale scenes, constructed from 194,400 RGB-D images. Each scene is a residential building consisting of multiple rooms and floor levels, and is annotated with surface construction, camera poses, and semantic segmentation.
324 PAPERS • 4 BENCHMARKS
SUNCG is a large-scale dataset of synthetic 3D scenes with dense volumetric annotations.
178 PAPERS • NO BENCHMARKS YET
TUM RGB-D is an RGB-D dataset. It contains the color and depth images of a Microsoft Kinect sensor along the ground-truth trajectory of the sensor. The data was recorded at full frame rate (30 Hz) and sensor resolution (640x480). The ground-truth trajectory was obtained from a high-accuracy motion-capture system with eight high-speed tracking cameras (100 Hz).
163 PAPERS • NO BENCHMARKS YET
The YCB-Video dataset is a large-scale video dataset for 6D object pose estimation. provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames.
129 PAPERS • 5 BENCHMARKS
SUN3D contains a large-scale RGB-D video database, with 8 annotated sequences. Each frame has a semantic segmentation of the objects in the scene and information about the camera pose. It is composed by 415 sequences captured in 254 different spaces, in 41 different buildings. Moreover, some places have been captured multiple times at different moments of the day.
104 PAPERS • NO BENCHMARKS YET
ALFRED (Action Learning From Realistic Environments and Directives), is a new benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks.
97 PAPERS • NO BENCHMARKS YET
A hand-object interaction dataset with 3D pose annotations of hand and object. The dataset contains 66,034 training images and 11,524 test images from a total of 68 sequences. The sequences are captured in multi-camera and single-camera setups and contain 10 different subjects manipulating 10 different objects from YCB dataset. The annotations are automatically obtained using an optimization algorithm. The hand pose annotations for the test set are withheld and the accuracy of the algorithms on the test set can be evaluated with standard metrics using the CodaLab challenge submission(see project page). The object pose annotations for the test and train set are provided along with the dataset.
69 PAPERS • NO BENCHMARKS YET
T-LESS is a dataset for estimating the 6D pose, i.e. translation and rotation, of texture-less rigid objects. The dataset features thirty industry-relevant objects with no significant texture and no discriminative color or reflectance properties. The objects exhibit symmetries and mutual similarities in shape and/or size. Compared to other datasets, a unique property is that some of the objects are parts of others. The dataset includes training and test images that were captured with three synchronized sensors, specifically a structured-light and a time-of-flight RGB-D sensor and a high-resolution RGB camera. There are approximately 39K training and 10K test images from each sensor. Additionally, two types of 3D models are provided for each object, i.e. a manually created CAD model and a semi-automatically reconstructed one. Training images depict individual objects against a black background. Test images originate from twenty test scenes having varying complexity, which increases from
67 PAPERS • 2 BENCHMARKS
ETHD is a multi-view stereo benchmark / 3D reconstruction benchmark that covers a variety of indoor and outdoor scenes. Ground truth geometry has been obtained using a high-precision laser scanner. A DSLR camera as well as a synchronized multi-camera rig with varying field-of-view was used to capture images.
65 PAPERS • 1 BENCHMARK
SBU-Kinect-Interaction dataset version 2.0 comprises of RGB-D video sequences of humans performing interaction activities that are recording using the Microsoft Kinect sensor. This dataset was originally recorded for a class project, and it must be used only for the purposes of research. If you use this dataset in your work, please cite the following paper. Kiwon Yun, Jean Honorio, Debaleena Chattopadhyay, Tamara L. Berg, and Dimitris Samaras, The 2nd International Workshop on Human Activity Understanding from 3D Data at Conference on Computer Vision and Pattern Recognition (HAU3D-CVPRW), CVPR 2012
64 PAPERS • 4 BENCHMARKS
The CAD-60 and CAD-120 data sets comprise of RGB-D video sequences of humans performing activities which are recording using the Microsoft Kinect sensor. Being able to detect human activities is important for making personal assistant robots useful in performing assistive tasks. The CAD dataset comprises twelve different activities (composed of several sub-activities) performed by four people in different environments, such as a kitchen, a living room, and office, etc.
59 PAPERS • 1 BENCHMARK
SceneNN is an RGB-D scene dataset consisting of more than 100 indoor scenes. The scenes are captured at various places, e.g., offices, dormitory, classrooms, pantry, etc., from University of Massachusetts Boston and Singapore University of Technology and Design. All scenes are reconstructed into triangle meshes and have per-vertex and per-pixel annotation. The dataset is additionally enriched with fine-grained information such as axis-aligned bounding boxes, oriented bounding boxes, and object poses.
53 PAPERS • 1 BENCHMARK
The EYEDIAP dataset is a dataset for gaze estimation from remote RGB, and RGB-D (standard vision and depth), cameras. The recording methodology was designed by systematically including, and isolating, most of the variables which affect the remote gaze estimation algorithms:
45 PAPERS • 2 BENCHMARKS
REAL275 is a benchmark for category-level pose estimation. It contains 4300 training frames, 950 validation and 2750 for testing across 18 different real scenes.
42 PAPERS • 1 BENCHMARK
Diode Dense Indoor/Outdoor DEpth (DIODE) is the first standard dataset for monocular depth estimation comprising diverse indoor and outdoor scenes acquired with the same hardware setup. The training set consists of 8574 indoor and 16884 outdoor samples from 20 scans each. The validation set contains 325 indoor and 446 outdoor samples with each set from 10 different scans. The ground truth density for the indoor training and validation splits are approximately 99.54% and 99%, respectively. The density of the outdoor sets are naturally lower with 67.19% for training and 78.33% for validation subsets. The indoor and outdoor ranges for the dataset are 50m and 300m, respectively.
39 PAPERS • 2 BENCHMARKS
For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real images. Hypersim is a photorealistic synthetic dataset for holistic indoor scene understanding. It contains 77,400 images of 461 indoor scenes with detailed per-pixel labels and corresponding ground truth geometry.
39 PAPERS • 1 BENCHMARK
UAV-Human is a large dataset for human behavior understanding with UAVs. It contains 67,428 multi-modal video sequences and 119 subjects for action recognition, 22,476 frames for pose estimation, 41,290 frames and 1,144 identities for person re-identification, and 22,263 frames for attribute recognition. The dataset was collected by a flying UAV in multiple urban and rural districts in both daytime and nighttime over three months, hence covering extensive diversities w.r.t subjects, backgrounds, illuminations, weathers, occlusions, camera motions, and UAV flying attitudes. This dataset can be used for UAV-based human behavior understanding, including action recognition, pose estimation, re-identification, and attribute recognition.
32 PAPERS • 5 BENCHMARKS
AVD focuses on simulating robotic vision tasks in everyday indoor environments using real imagery. The dataset includes 20,000+ RGB-D images and 50,000+ 2D bounding boxes of object instances densely captured in 9 unique scenes.
28 PAPERS • 1 BENCHMARK
InteriorNet is a RGB-D for large scale interior scene understanding and mapping. The dataset contains 20M images created by pipeline:
25 PAPERS • NO BENCHMARKS YET
Dataset containing RGB-D data of 4 large scenes, comprising a total of 12 rooms, for the purpose of RGB and RGB-D camera relocalization. The RGB-D data was captured using a Structure.io depth sensor coupled with an iPad color camera. Each room was scanned multiple times, with the multiple sequences run through a global bundle adjustment in order to obtain globally aligned camera poses though all sequences of the same scene.
22 PAPERS • NO BENCHMARKS YET
The ReDWeb dataset consists of 3600 RGB-RD image pairs collected from the Web. This dataset covers a wide range of scenes and features various non-rigid objects.
22 PAPERS • NO BENCHMARKS YET
The Wide Multi Channel Presentation Attack (WMCA) database consists of 1941 short video recordings of both bonafide and presentation attacks from 72 different identities. The data is recorded from several channels including color, depth, infra-red, and thermal.
22 PAPERS • 1 BENCHMARK
The How2Sign is a multimodal and multiview continuous American Sign Language (ASL) dataset consisting of a parallel corpus of more than 80 hours of sign language videos and a set of corresponding modalities including speech, English transcripts, and depth. A three-hour subset was further recorded in the Panoptic studio enabling detailed 3D pose estimation.
21 PAPERS • 3 BENCHMARKS
ARKitScenes is an RGB-D dataset captured with the widely available Apple LiDAR scanner. Along with the per-frame raw data (Wide Camera RGB, Ultra Wide camera RGB, LiDar scanner depth, IMU) the authors also provide the estimated ARKit camera pose and ARKit scene reconstruction for each iPad Pro sequence. In addition to the raw and processed data from the mobile device, ARKit.
19 PAPERS • 1 BENCHMARK
The REALY benchmark aims to introduce a region-aware evaluation pipeline to measure the fine-grained normalized mean square error (NMSE) of 3D face reconstruction methods from under-controlled image sets.
19 PAPERS • 2 BENCHMARKS
The Drive&Act dataset is a state of the art multi modal benchmark for driver behavior recognition. The dataset includes 3D skeletons in addition to frame-wise hierarchical labels of 9.6 Million frames captured by 6 different views and 3 modalities (RGB, IR and depth).
17 PAPERS • 1 BENCHMARK
Source: https://www.vicos.si/Projects/CDTB 4.2 State-of-the-art Comparison A TH CTB (color-and-depth visual object tracking) dataset is recorded by several passive and active RGB-D setups and contains indoor as well as outdoor sequences acquired in direct sunlight. The sequences were recorded to contain significant object pose change, clutter, occlusion, and periods of long-term target absence to enable tracker evaluation under realistic conditions. Sequences are per-frame annotated with 13 visual attributes for detailed analysis. It contains around 100,000 samples. Image Source: https://www.vicos.si/Projects/CDTB
16 PAPERS • NO BENCHMARKS YET
Developing robot perception systems for handling objects in the real-world requires computer vision algorithms to be carefully scrutinized with respect to the expected operating domain. This demands large quantities of ground truth data to rigorously evaluate the performance of algorithms.
16 PAPERS • 1 BENCHMARK
Washington RGB-D is a widely used testbed in the robotic community, consisting of 41,877 RGB-D images organized into 300 instances divided in 51 classes of common indoor objects (e.g. scissors, cereal box, keyboard etc). Each object instance was positioned on a turntable and captured from three different viewpoints while rotating.
15 PAPERS • NO BENCHMARKS YET
BEHAVE is a full body human-object interaction dataset with multi-view RGBD frames and corresponding 3D SMPL and object fits along with the annotated contacts between them. Dataset contains ~15k frames at 5 locations with 8 subjects performing a wide range of interactions with 20 common objects.
14 PAPERS • NO BENCHMARKS YET
First-Person Hand Action Benchmark is a collection of RGB-D video sequences comprised of more than 100K frames of 45 daily hand action categories, involving 26 different objects in several hand configurations.
13 PAPERS • 2 BENCHMARKS
The dataset was collected using the Intel RealSense D435i camera, which was configured to produce synchronized accelerometer and gyroscope measurements at 400 Hz, along with synchronized VGA-size (640 x 480) RGB and depth streams at 30 Hz. The depth frames are acquired using active stereo and is aligned to the RGB frame using the sensor factory calibration. All the measurements are timestamped.
12 PAPERS • 1 BENCHMARK
The ScanNet200 benchmark studies 200-class 3D semantic segmentation - an order of magnitude more class categories than previous 3D scene understanding benchmarks. The source of scene data is identical to ScanNet, but parses a larger vocabulary for semantic and instance segmentation
11 PAPERS • 2 BENCHMARKS
This dataset accompanies our paper on synthesizing the 3D Ken Burns effect from a single image. It consists of 134041 captures from 32 virtual environments where each capture consists of 4 views. Each view contains color-, depth-, and normal-maps at a resolution of 512x512 pixels.
9 PAPERS • NO BENCHMARKS YET
The EgoDexter dataset provides both 2D and 3D pose annotations for 4 testing video sequences with 3190 frames. The videos are recorded with body-mounted camera from egocentric viewpoints and contain cluttered backgrounds, fast camera motion, and complex interactions with various objects. Fingertip positions were manually annotated for 1485 out of 3190 frames.
9 PAPERS • NO BENCHMARKS YET
GraspNet-1Billion provides large-scale training data and a standard evaluation platform for the task of general robotic grasping. The dataset contains 97,280 RGB-D image with over one billion grasp poses.
9 PAPERS • 1 BENCHMARK
The Hands in action dataset (HIC) dataset has RGB-D sequences of hands interacting with objects.
8 PAPERS • NO BENCHMARKS YET
The High-Quality Wide Multi-Channel Attack database (HQ-WMCA) database consists of 2904 short multi-modal video recordings of both bona-fide and presentation attacks. There are 555 bonafide presentations from 51 participants and the remaining 2349 are presentation attacks. The data is recorded from several channels including color, depth, thermal, infrared (spectra), and short-wave infrared (spectra).
8 PAPERS • NO BENCHMARKS YET
DeepLoc is a large-scale urban outdoor localization dataset. The dataset is currently comprised of one scene spanning an area of 110 x 130 m, that a robot traverses multiple times with different driving patterns. The dataset creators use a LiDAR-based SLAM system with sub-centimeter and sub-degree accuracy to compute the pose labels that provided as groundtruth. Poses in the dataset are approximately spaced by 0.5 m which is twice as dense as other relocalization datasets.
7 PAPERS • NO BENCHMARKS YET
The HRWSI dataset consists of about 21K diverse high-resolution RGB-D image pairs derived from the Web stereo images. Also, it provides sky segmentation masks, instance segmentation masks as well as invalid pixel masks.
7 PAPERS • NO BENCHMARKS YET
The HandNet dataset contains depth images of 10 participants' hands non-rigidly deforming in front of a RealSense RGB-D camera. The annotations are generated by a magnetic annotation technique. 6D pose is available for the center of the hand as well as the five fingertips (i.e. position and orientation of each).
7 PAPERS • NO BENCHMARKS YET
This is a 3D action recognition dataset, also known as 3D Action Pairs dataset. The actions in this dataset are selected in pairs such that the two actions of each pair are similar in motion (have similar trajectories) and shape (have similar objects); however, the motion-shape relation is different.
7 PAPERS • 1 BENCHMARK
NTU RGB+D 2D is a curated version of NTU RGB+D often used for skeleton-based action prediction and synthesis. It contains less number of actions.
7 PAPERS • 1 BENCHMARK
The Freiburg Forest dataset was collected using a Viona autonomous mobile robot platform equipped with cameras for capturing multi-spectral and multi-modal images. The dataset may be used for evaluation of different perception algorithms for segmentation, detection, classification, etc. All scenes were recorded at 20 Hz with a camera resolution of 1024x768 pixels. The data was collected on three different days to have enough variability in lighting conditions as shadows and sun angles play a crucial role in the quality of acquired images. The robot traversed about 4.7 km each day. The dataset creators provide manually annotated pixel-wise ground truth segmentation masks for 6 classes: Obstacle, Trail, Sky, Grass, Vegetation, and Void.
6 PAPERS • 2 BENCHMARKS