SUN3D contains a large-scale RGB-D video database, with 8 annotated sequences. Each frame has a semantic segmentation of the objects in the scene and information about the camera pose. It is composed by 415 sequences captured in 254 different spaces, in 41 different buildings. Moreover, some places have been captured multiple times at different moments of the day.
114 PAPERS • NO BENCHMARKS YET
KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile robotics and autonomous driving. It consists of hours of traffic scenarios recorded with a variety of sensor modalities, including high-resolution RGB, grayscale stereo cameras, and a 3D laser scanner. Despite its popularity, the dataset itself does not contain ground truth for semantic segmentation. However, various researchers have manually annotated parts of the dataset to fit their necessities. Álvarez et al. generated ground truth for 323 images from the road detection challenge with three classes: road, vertical, and sky. Zhang et al. annotated 252 (140 for training and 112 for testing) acquisitions – RGB and Velodyne scans – from the tracking challenge for ten object categories: building, sky, road, vegetation, sidewalk, car, pedestrian, cyclist, sign/pole, and fence. Ros et al. labeled 170 training images and 46 testing images (from the visual odome
3,233 PAPERS • 141 BENCHMARKS
The NYU-Depth V2 data set is comprised of video sequences from a variety of indoor scenes as recorded by both the RGB and Depth cameras from the Microsoft Kinect. It features:
844 PAPERS • 20 BENCHMARKS
TUM RGB-D is an RGB-D dataset. It contains the color and depth images of a Microsoft Kinect sensor along the ground-truth trajectory of the sensor. The data was recorded at full frame rate (30 Hz) and sensor resolution (640x480). The ground-truth trajectory was obtained from a high-accuracy motion-capture system with eight high-speed tracking cameras (100 Hz).
190 PAPERS • NO BENCHMARKS YET
The Make3D dataset is a monocular Depth Estimation dataset that contains 400 single training RGB and depth map pairs, and 134 test samples. The RGB images have high resolution, while the depth maps are provided at low resolution.
122 PAPERS • 1 BENCHMARK
The Middlebury 2006 is a stereo dataset of indoor scenes with multiple handcrafted layouts.
5 PAPERS • NO BENCHMARKS YET
The Middlebury Stereo dataset consists of high-resolution stereo sequences with complex geometry and pixel-accurate ground-truth disparity data. The ground-truth disparities are acquired using a novel technique that employs structured lighting and does not require the calibration of the light projectors.
205 PAPERS • 5 BENCHMARKS
Collects high quality 360 datasets with ground truth depth annotations, by re-using recently released large scale 3D datasets and re-purposing them to 360 via rendering.
14 PAPERS • NO BENCHMARKS YET
4D Light Field Dataset is a light field benchmark consisting of 24 carefully designed synthetic, densely sampled 4D light fields with highly accurate disparity ground truth.
2 PAPERS • NO BENCHMARKS YET
The DCM dataset is composed of 772 annotated images from 27 golden age comic books. We freely collected them from the free public domain collection of digitized comic books Digital Comics Museum. One album per available publisher was selected to get as many different styles as possible. We made ground-truth bounding boxes of all panels, all characters (body + faces), small or big, human-like or animal-like.
4 PAPERS • 3 BENCHMARKS
DDAD is a new autonomous driving benchmark from TRI (Toyota Research Institute) for long range (up to 250m) and dense depth estimation in challenging and diverse urban conditions. It contains monocular videos and accurate ground-truth depth (across a full 360 degree field of view) generated from high-density LiDARs mounted on a fleet of self-driving cars operating in a cross-continental setting. DDAD contains scenes from urban settings in the United States (San Francisco, Bay Area, Cambridge, Detroit, Ann Arbor) and Japan (Tokyo, Odaiba).
55 PAPERS • 1 BENCHMARK
DrivingStereo contains over 180k images covering a diverse set of driving scenarios, which is hundreds of times larger than the KITTI Stereo dataset. High-quality labels of disparity are produced by a model-guided filtering strategy from multi-frame LiDAR points.
41 PAPERS • NO BENCHMARKS YET
ETHD is a multi-view stereo benchmark / 3D reconstruction benchmark that covers a variety of indoor and outdoor scenes. Ground truth geometry has been obtained using a high-precision laser scanner. A DSLR camera as well as a synchronized multi-camera rig with varying field-of-view was used to capture images.
80 PAPERS • 1 BENCHMARK
The endoscopic SLAM dataset (EndoSLAM) is a dataset for depth estimation approach for endoscopic videos. It consists of both ex-vivo and synthetically generated data. The ex-vivo part of the dataset includes standard as well as capsule endoscopy recordings. The dataset is divided into 35 sub-datasets. Specifically, 18, 5 and 12 sub-datasets exist for colon, small intestine and stomach respectively.
3 PAPERS • NO BENCHMARKS YET
An in-the-wild stereo image dataset, comprising 49,368 image pairs contributed by users of the Holopix mobile social platform.
9 PAPERS • NO BENCHMARKS YET
For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real images. Hypersim is a photorealistic synthetic dataset for holistic indoor scene understanding. It contains 77,400 images of 461 indoor scenes with detailed per-pixel labels and corresponding ground truth geometry.
61 PAPERS • 1 BENCHMARK
The MegaDepth dataset is a dataset for single-view depth prediction that includes 196 different locations reconstructed from COLMAP SfM/MVS.
115 PAPERS • NO BENCHMARKS YET
A dataset for single-image 3D in the wild consisting of annotations of detailed 3D geometry for 140,000 images.
23 PAPERS • 2 BENCHMARKS
Aa new cross-season scaleless monocular depth prediction dataset from CMU Visual Localization dataset through structure from motion.
The Stanford Light Field Archive is a collection of several light fields for research in computer graphics and vision.
7 PAPERS • NO BENCHMARKS YET
Taskonomy provides a large and high-quality dataset of varied indoor scenes.
135 PAPERS • 2 BENCHMARKS
The Web Stereo Video Dataset consists of 553 stereoscopic videos from YouTube. This dataset has a wide variety of scene types, and features many nonrigid objects.
12 PAPERS • NO BENCHMARKS YET