The Middlebury Stereo dataset consists of high-resolution stereo sequences with complex geometry and pixel-accurate ground-truth disparity data. The ground-truth disparities are acquired using a novel technique that employs structured lighting and does not require the calibration of the light projectors.
176 PAPERS • 5 BENCHMARKS
MPI (Max Planck Institute) Sintel is a dataset for optical flow evaluation that has 1064 synthesized stereo images and ground truth data for disparity. Sintel is derived from open-source 3D animated short film Sintel. The dataset has 23 different scenes. The stereo images are RGB while the disparity is grayscale. Both have resolution of 1024×436 pixels and 8-bit per channel.
138 PAPERS • 4 BENCHMARKS
ETHD is a multi-view stereo benchmark / 3D reconstruction benchmark that covers a variety of indoor and outdoor scenes. Ground truth geometry has been obtained using a high-precision laser scanner. A DSLR camera as well as a synchronized multi-camera rig with varying field-of-view was used to capture images.
43 PAPERS • 1 BENCHMARK
The Middlebury 2014 dataset contains a set of 23 high resolution stereo pairs for which known camera calibration parameters and ground truth disparity maps obtained with a structured light scanner are available. The images in the Middlebury dataset all show static indoor scenes with varying difficulties including repetitive structures, occlusions, wiry objects as well as untextured areas.
38 PAPERS • 2 BENCHMARKS
The Web Stereo Video Dataset consists of 553 stereoscopic videos from YouTube. This dataset has a wide variety of scene types, and features many nonrigid objects.
9 PAPERS • NO BENCHMARKS YET
The Multi Vehicle Stereo Event Camera (MVSEC) dataset is a collection of data designed for the development of novel 3D perception algorithms for event based cameras. Stereo event data is collected from car, motorbike, hexacopter and handheld data, and fused with lidar, IMU, motion capture and GPS to provide ground truth pose and depth images.
8 PAPERS • NO BENCHMARKS YET
Middlebury 2005 is a stereo dataset of indoor scenes.
7 PAPERS • NO BENCHMARKS YET
The Middlebury 2006 is a stereo dataset of indoor scenes with multiple handcrafted layouts.
5 PAPERS • NO BENCHMARKS YET
The Middlebury 2001 is a stereo dataset of indoor scenes with multiple handcrafted layouts.
3 PAPERS • NO BENCHMARKS YET
PedX is a large-scale multi-modal collection of pedestrians at complex urban intersections. The dataset provides high-resolution stereo images and LiDAR data with manual 2D and automatic 3D annotations. The data was captured using two pairs of stereo cameras and four Velodyne LiDAR sensors.
Endoscopic stereo reconstruction for surgical scenes gives rise to specific problems, including the lack of clear corner features, highly specular surface properties, and the presence of blood and smoke. These issues present difficulties for both stereo reconstruction itself and also for standardised dataset production. We present a stereo-endoscopic reconstruction validation dataset based on cone-beam CT (SERV-CT). Two ex vivo small porcine full torso cadavers were placed within the view of the endoscope with both the endoscope and target anatomy visible in the CT scan. Subsequent orientation of the endoscope was manually aligned to match the stereoscopic view and benchmark disparities, depths and occlusions are calculated. The requirement of a CT scan limited the number of stereo pairs to 8 from each ex vivo sample. For the second sample an RGB surface was acquired to aid alignment of smooth, featureless surfaces. Repeated manual alignments showed an RMS disparity accuracy of around
The UASOL an RGB-D stereo dataset, that contains 160902 frames, filmed at 33 different scenes, each with between 2 k and 10 k frames. The frames show different paths from the perspective of a pedestrian, including sidewalks, trails, roads, etc. The images were extracted from video files with 15 fps at HD2K resolution with a size of 2280 × 1282 pixels. The dataset also provides a GPS geolocalization tag for each second of the sequences and reflects different climatological conditions. It also involved up to 4 different persons filming the dataset at different moments of the day.
3 PAPERS • 1 BENCHMARK
Middlebury 2003 is a stereo dataset for indoor scenes.
2 PAPERS • NO BENCHMARKS YET
RealEstate10K is a large dataset of camera poses corresponding to 10 million frames derived from about 80,000 video clips, gathered from about 10,000 YouTube videos. For each clip, the poses form a trajectory where each pose specifies the camera position and orientation along the trajectory. These poses are derived by running SLAM and bundle adjustment algorithms on a large set of videos.
2 PAPERS • 1 BENCHMARK
This dataset contains the bus trajectory dataset collected by 6 volunteers who were asked to travel across the sub-urban city of Durgapur, India, on intra-city buses (route name: 54 Feet). During the travel, the volunteers captured sensor logs through an Android application installed on COTS smartphones.
1 PAPER • NO BENCHMARKS YET
Given the difficulty to handle planetary data we provide downloadable files in PNG format from the missions Chang'E-3 and Chang'E-4. In addition to a set of scripts to do the conversion given a different PDS4 Dataset.
The dataset has been generated using Town 1 and Town 2 of CARLA Simulator. The dataset consists of $48$ camera configurations with each town having $24$ configurations. The parameters modified for generating the configurations include $fov$, $x$, $y$, $z$, pitch, yaw, and roll. Here, $fov$ is the field of view, (x, y, z) is the translation while (pitch, yaw, and roll) is the rotation between the cameras. The total number of image pairs is $79,320$, out of which $18,083$ belong to Town 1 while $61,237$ belong to Town 2, the difference in the number of images is due to the length of the tracks.
DurLAR is a high-fidelity 128-channel 3D LiDAR dataset with panoramic ambient (near infrared) and reflectivity imagery for multi-modal autonomous driving applications. Compared to existing autonomous driving task datasets, DurLAR has the following novel features:
A Simulated Benchmark for multi-modal SLAM Systems Evaluation in Large-scale Dynamic Environments.
It contains grayscale mono and stereo images (NavCam and LocCam) from laboratory tests performed by a prototype rover on a martian-like testbed. The dataset can be used for artificial sample-tube detection and pose estimation. It also contains synthetic color images of the sample tube on a martian scenario created with Unreal Engine.
Middlebury MVS is the earliest MVS dataset for multi-view stereo network evaluation. It contains two indoor objects with low-resolution (640 × 480) images and calibrated cameras.
A total of 80 real material samples were captured in a dark room. For each material, multiple captures were collected at different distances from the camera (between 250 and 650 mm) to observe both macro- and micro-level details. The dataset is mostly comprised of planar specimens but also includes non-planar objects such as mugs, globes, crumpled paper, etc. As shown above, it contains a rich diversity of materials, including diffuse or specular wrapping papers, fabrics, anisotropic metals, plastics, rugs, ceramic and wood flooring samples, etc. Each capture set includes 12 LDR (8 bpp) RGB-D images at 4K pixel resolution. Each set is captured at 50% and 100% of maximum light intensity. In total, we captured 462 such image sets (combinations of light intensities, distances to the camera, and material sample).
The ARPA-E funded TERRA-REF project is generating open-access reference datasets for the study of plant sensing, genomics, and phenomics. Sensor data were generated by a field scanner sensing platform that captures color, thermal, hyperspectral, and active flourescence imagery as well as three dimensional structure and associated environmental measurements. This dataset is provided alongside data collected using traditional field methods in order to support calibration and validation of algorithms used to extract plot level phenotypes from these datasets.
Includes several sets of synthetic stereo images labelled with grasp rectangles representing parallel-jaw grasps (Cornell-like format).
THEOStereo is a dataset providing synthetic stereo image pairs and their corresponding scene depth and will be published along with 1. All images follow the omnidirectional camera model. In total, there are 31,250 omnidirectional images pairs. The training set contains 25,000 image pairs. For validation and testing there are 3,125 image pairs, respectively. For each pair, there is a ground truth depth map describing the pixel-wise distance of the object along the left camera's z-axis. The virtual omnidirectional cameras exhibit a FOV of 180 degrees and can be described using Kannala's camera model 2. The distortion parameters are k_1 = 1 and k_2 = k_3 = k_4 = k_5 = 0. The length of the stereo camera's baseline was 0.3 AU (approx. 15 cm, not 30 cm!). Please do not forget to cite 1 if you use the dataset in your work. Thank you.
0 PAPER • NO BENCHMARKS YET