🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task (clear)

Filter by Language

54 dataset results for Depth Estimation AND Images

Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense pixel annotations for 30 classes grouped into 8 categories (flat surfaces, humans, vehicles, constructions, objects, nature, sky, and void). The dataset consists of around 5000 fine annotated images and 20000 coarse annotated ones. Data was captured in 50 cities during several months, daytimes, and good weather conditions. It was originally recorded as video so the frames were manually selected to have the following features: large number of dynamic objects, varying scene layout, and varying background.

3,350 PAPERS • 54 BENCHMARKS

ScanNet

ScanNet is an instance-level indoor RGB-D dataset that includes both 2D and 3D data. It is a collection of labeled voxels rather than points or objects. Up to now, ScanNet v2, the newest version of ScanNet, has collected 1513 annotated scans with an approximate 90% surface coverage. In the semantic segmentation task, this dataset is marked in 20 classes of annotated 3D voxelized objects.

1,266 PAPERS • 19 BENCHMARKS

Middlebury (Middlebury Stereo)

The Middlebury Stereo dataset consists of high-resolution stereo sequences with complex geometry and pixel-accurate ground-truth disparity data. The ground-truth disparities are acquired using a novel technique that employs structured lighting and does not require the calibration of the light projectors.

205 PAPERS • 5 BENCHMARKS

TUM RGB-D

TUM RGB-D is an RGB-D dataset. It contains the color and depth images of a Microsoft Kinect sensor along the ground-truth trajectory of the sensor. The data was recorded at full frame rate (30 Hz) and sensor resolution (640x480). The ground-truth trajectory was obtained from a high-accuracy motion-capture system with eight high-speed tracking cameras (100 Hz).

192 PAPERS • NO BENCHMARKS YET

SUNCG

SUNCG is a large-scale dataset of synthetic 3D scenes with dense volumetric annotations.

181 PAPERS • NO BENCHMARKS YET

Taskonomy

Taskonomy provides a large and high-quality dataset of varied indoor scenes.

136 PAPERS • 2 BENCHMARKS

2D-3D-S (2D-3D-Semantic)

The 2D-3D-S dataset provides a variety of mutually registered modalities from 2D, 2.5D and 3D domains, with instance-level semantic and geometric annotations. It covers over 6,000 m2 collected in 6 large-scale indoor areas that originate from 3 different buildings. It contains over 70,000 RGB images, along with the corresponding depths, surface normals, semantic annotations, global XYZ images (all in forms of both regular and 360° equirectangular images) as well as camera information. It also includes registered raw and semantically annotated 3D meshes and point clouds. The dataset enables development of joint and cross-modal learning models and potentially unsupervised approaches utilizing the regularities present in large-scale indoor spaces.

130 PAPERS • 8 BENCHMARKS

Make3D

The Make3D dataset is a monocular Depth Estimation dataset that contains 400 single training RGB and depth map pairs, and 134 test samples. The RGB images have high resolution, while the depth maps are provided at low resolution.

122 PAPERS • 1 BENCHMARK

Virtual KITTI

Virtual KITTI is a photo-realistic synthetic video dataset designed to learn and evaluate computer vision models for several video understanding tasks: object detection and multi-object tracking, scene-level and instance-level semantic segmentation, optical flow, and depth estimation.

121 PAPERS • 1 BENCHMARK

MegaDepth

The MegaDepth dataset is a dataset for single-view depth prediction that includes 196 different locations reconstructed from COLMAP SfM/MVS.

117 PAPERS • NO BENCHMARKS YET

SUN3D

SUN3D contains a large-scale RGB-D video database, with 8 annotated sequences. Each frame has a semantic segmentation of the objects in the scene and information about the camera pose. It is composed by 415 sequences captured in 254 different spaces, in 41 different buildings. Moreover, some places have been captured multiple times at different moments of the day.

115 PAPERS • NO BENCHMARKS YET

ETH3D

ETHD is a multi-view stereo benchmark / 3D reconstruction benchmark that covers a variety of indoor and outdoor scenes. Ground truth geometry has been obtained using a high-precision laser scanner. A DSLR camera as well as a synchronized multi-camera rig with varying field-of-view was used to capture images.

81 PAPERS • 1 BENCHMARK

Hypersim

For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real images. Hypersim is a photorealistic synthetic dataset for holistic indoor scene understanding. It contains 77,400 images of 461 indoor scenes with detailed per-pixel labels and corresponding ground truth geometry.

63 PAPERS • 1 BENCHMARK

DIODE (Dense Indoor and Outdoor Depth)

Diode Dense Indoor/Outdoor DEpth (DIODE) is the first standard dataset for monocular depth estimation comprising diverse indoor and outdoor scenes acquired with the same hardware setup. The training set consists of 8574 indoor and 16884 outdoor samples from 20 scans each. The validation set contains 325 indoor and 446 outdoor samples with each set from 10 different scans. The ground truth density for the indoor training and validation splits are approximately 99.54% and 99%, respectively. The density of the outdoor sets are naturally lower with 67.19% for training and 78.33% for validation subsets. The indoor and outdoor ranges for the dataset are 50m and 300m, respectively.

58 PAPERS • 2 BENCHMARKS

DDAD (Dense Depth for Autonomous Driving)

DDAD is a new autonomous driving benchmark from TRI (Toyota Research Institute) for long range (up to 250m) and dense depth estimation in challenging and diverse urban conditions. It contains monocular videos and accurate ground-truth depth (across a full 360 degree field of view) generated from high-density LiDARs mounted on a fleet of self-driving cars operating in a cross-continental setting. DDAD contains scenes from urban settings in the United States (San Francisco, Bay Area, Cambridge, Detroit, Ann Arbor) and Japan (Tokyo, Odaiba).

57 PAPERS • 1 BENCHMARK

Middlebury 2014

The Middlebury 2014 dataset contains a set of 23 high resolution stereo pairs for which known camera calibration parameters and ground truth disparity maps obtained with a structured light scanner are available. The images in the Middlebury dataset all show static indoor scenes with varying difficulties including repetitive structures, occlusions, wiry objects as well as untextured areas.

52 PAPERS • 2 BENCHMARKS

DENSE (Depth Estimation oN Synthetic Events)

DENSE (Depth Estimation oN Synthetic Events) is a new dataset with synthetic events and perfect ground truth.

38 PAPERS • 1 BENCHMARK

MannequinChallenge

The MannequinChallenge Dataset (MQC) provides in-the-wild videos of people in static poses while a hand-held camera pans around the scene. The dataset consists of three splits for training, validation and testing.

26 PAPERS • NO BENCHMARKS YET

2D-3D Match Dataset

2D-3D Match Dataset is a new dataset of 2D-3D correspondences by leveraging the availability of several 3D datasets from RGB-D scans. Specifically, the data from SceneNN and 3DMatch are used. The training dataset consists of 110 RGB-D scans, of which 56 scenes are from SceneNN and 54 scenes are from 3DMatch. The 2D-3D correspondence data is generated as follows. Given a 3D point which is randomly sampled from a 3D point cloud, a set of 3D patches from different scanning views are extracted. To find a 2D-3D correspondence, for each 3D patch, its 3D position is re-projected into all RGB-D frames for which the point lies in the camera frustum, taking occlusion into account. The corresponding local 2D patches around the re-projected point are extracted. In total, around 1.4 millions 2D-3D correspondences are collected.

25 PAPERS • NO BENCHMARKS YET

OASIS

OASIS (Open Annotations of Single Image Surfaces)

A dataset for single-image 3D in the wild consisting of annotations of detailed 3D geometry for 140,000 images.

24 PAPERS • 2 BENCHMARKS

VOID (Visual Odometry with Inertial and Depth)

The dataset was collected using the Intel RealSense D435i camera, which was configured to produce synchronized accelerometer and gyroscope measurements at 400 Hz, along with synchronized VGA-size (640 x 480) RGB and depth streams at 30 Hz. The depth frames are acquired using active stereo and is aligned to the RGB frame using the sensor factory calibration. All the measurements are timestamped.

18 PAPERS • 1 BENCHMARK

KITTI-Depth

The KITTI-Depth dataset includes depth maps from projected LiDAR point clouds that were matched against the depth estimation from the stereo cameras. The depth images are highly sparse with only 5% of the pixels available and the rest is missing. The dataset has 86k training images, 7k validation images, and 1k test set images on the benchmark server with no access to the ground truth.

14 PAPERS • NO BENCHMARKS YET

Depth in the Wild

Depth in the Wild is a dataset for single-image depth perception in the wild, i.e., recovering depth from a single image taken in unconstrained settings. It consists of images in the wild annotated with relative depth between pairs of random points.

9 PAPERS • NO BENCHMARKS YET

Holopix50k

An in-the-wild stereo image dataset, comprising 49,368 image pairs contributed by users of the Holopix mobile social platform.

9 PAPERS • NO BENCHMARKS YET

HUMAN4D

HUMAN4D is a large and multimodal 4D dataset that contains a variety of human activities simultaneously captured by a professional marker-based MoCap, a volumetric capture and an audio recording system. By capturing 2 female and $2$ male professional actors performing various full-body movements and expressions, HUMAN4D provides a diverse set of motions and poses encountered as part of single- and multi-person daily, physical and social activities (jumping, dancing, etc. ), along with multi-RGBD (mRGBD), volumetric and audio data.

8 PAPERS • NO BENCHMARKS YET

Stanford Light Field

The Stanford Light Field Archive is a collection of several light fields for research in computer graphics and vision.

7 PAPERS • NO BENCHMARKS YET

DurLAR (A High-Fidelity 128-Channel LiDAR Dataset with Panoramic Ambient and Reflectivity Imagery)

DurLAR is a high-fidelity 128-channel 3D LiDAR dataset with panoramic ambient (near infrared) and reflectivity imagery for multi-modal autonomous driving applications. Compared to existing autonomous driving task datasets, DurLAR has the following novel features:

5 PAPERS • NO BENCHMARKS YET

Middlebury 2006

The Middlebury 2006 is a stereo dataset of indoor scenes with multiple handcrafted layouts.

5 PAPERS • NO BENCHMARKS YET

SYNS-Patches

SYNS-Patches dataset, which is a subset of SYNS. The original SYNS is composed of aligned image and LiDAR panoramas from 92 different scenes belonging to a wide variety of environments, such as Agriculture, Natural (e.g. forests and fields), Residential, Industrial and Indoor. It represents the subset of patches from each scene extracted at eye level at 20 degree intervals of a full horizontal rotation. This results in 18 images per scene and a total dataset size of 1656.

5 PAPERS • NO BENCHMARKS YET

CocoDoom

CocoDoom is a collection of pre-recorded data extracted from Doom gaming sessions along with annotations in the MS Coco format.

4 PAPERS • NO BENCHMARKS YET

DCM

The DCM dataset is composed of 772 annotated images from 27 golden age comic books. We freely collected them from the free public domain collection of digitized comic books Digital Comics Museum. One album per available publisher was selected to get as many different styles as possible. We made ground-truth bounding boxes of all panels, all characters (body + faces), small or big, human-like or animal-like.

4 PAPERS • 3 BENCHMARKS

3D Ken Burns Dataset

Provides a large-scale synthetic dataset which contains accurate ground truth depth of various photo-realistic scenes.

3 PAPERS • NO BENCHMARKS YET

EDEN

EDEN (Enclosed garDEN) is a multimodal synthetic dataset, a dataset for nature-oriented applications. The dataset features more than 300K images captured from more than 100 garden models. Each image is annotated with various low/high-level vision modalities, including semantic segmentation, depth, surface normals, intrinsic colors, and optical flow.

3 PAPERS • NO BENCHMARKS YET

EndoSLAM (Endoscopic SLAM dataset)

The endoscopic SLAM dataset (EndoSLAM) is a dataset for depth estimation approach for endoscopic videos. It consists of both ex-vivo and synthetically generated data. The ex-vivo part of the dataset includes standard as well as capsule endoscopy recordings. The dataset is divided into 35 sub-datasets. Specifically, 18, 5 and 12 sub-datasets exist for colon, small intestine and stomach respectively.

3 PAPERS • NO BENCHMARKS YET

FutureHouse

We present a new large-scale photorealistic panoramic dataset named FutureHouse, which has the following characteristics. 1) It contains over 70,000 high-quality models with high-resolution meshes and physical material. All models are measured in real world standards. 2) Selected scene layouts are carefully designed by over 100 excellent artists. All of selected layouts are used in realworld display. 3) It contains 28,579 good panoramic views from 1,752 house-scale scenes. Therefore, it can be used for perspective image tasks as well as omnidirectional image tasks. 4) More physical material representation. Most materials are represent by microfacet BRDF modeling metalness, and the rest are represent by special shading models, e.g., cloth material and transmission material. 5) High rendering quality. Benefiting from commercial rendering engine, Unreal engine 4, and powerful deep learning super sampling (DLSS), our renderings have less noise. Our SVBRDF rep

3 PAPERS • NO BENCHMARKS YET

NERDS 360 (NeRF for Reconstruction, Decomposition and Scene Synthesis of 360° outdoor scenes)

We present a large-scale dataset for 3D urban scene understanding. Compared to existing datasets, our dataset consists of 75 outdoor urban scenes with diverse backgrounds, encompassing over 15,000 images. These scenes offer 360◦ hemispherical views, capturing diverse foreground objects illuminated under various lighting conditions. Additionally, our dataset encompasses scenes that are not limited to forward-driving views, addressing the limitations of previous datasets such as limited overlap and coverage between camera views. The closest pre-existing dataset for generalizable evaluation is DTU [2] (80 scenes) which comprises mostly indoor objects and does not provide multiple foreground objects or background scenes.

3 PAPERS • 1 BENCHMARK

UASOL (A large-scale high-resolution outdoor stereo dataset)

The UASOL an RGB-D stereo dataset, that contains 160902 frames, filmed at 33 different scenes, each with between 2 k and 10 k frames. The frames show different paths from the perspective of a pedestrian, including sidewalks, trails, roads, etc. The images were extracted from video files with 15 fps at HD2K resolution with a size of 2280 × 1282 pixels. The dataset also provides a GPS geolocalization tag for each second of the sequences and reflects different climatological conditions. It also involved up to 4 different persons filming the dataset at different moments of the day.

3 PAPERS • 1 BENCHMARK

eBDtheque

The eBDtheque database is a selection of one hundred comic pages from America, Japan (manga) and Europe.

3 PAPERS • 1 BENCHMARK

4D Light Field Dataset

4D Light Field Dataset is a light field benchmark consisting of 24 carefully designed synthetic, densely sampled 4D light fields with highly accurate disparity ground truth.

2 PAPERS • NO BENCHMARKS YET

ENRICH

ENRICH (Multi-purposE dataset for beNchmaRking In Computer vision and pHotogrammetry)

A new synthetic, multi-purpose dataset - called ENRICH - for testing photogrammetric and computer vision algorithms. Compared to existing datasets, ENRICH offers higher resolution images also rendered with different lighting conditions, camera orientation, scales, and field of view. Specifically, ENRICH is composed of three sub-datasets: ENRICH-Aerial, ENRICH-Square, and ENRICH-Statue, each exhibiting different characteristics. The proposed dataset is useful for several photogrammetry and computer vision-related tasks, such as the evaluation of hand-crafted and deep learning-based local features, effects of ground control points (GCPs) configuration on the 3D accuracy, and monocular depth estimation.

2 PAPERS • NO BENCHMARKS YET

G-VUE (General-purpose Visual Understanding Evaluation)

General-purpose Visual Understanding Evaluation (G-VUE) is a comprehensive benchmark covering the full spectrum of visual cognitive abilities with four functional domains -- Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation.

2 PAPERS • NO BENCHMARKS YET

Mila Simulated Floods

Mila Simulated Floods Dataset is a 1.5 square km virtual world using the Unity3D game engine including urban, suburban and rural areas.

2 PAPERS • 1 BENCHMARK

Pano3D

Pano3D is a new benchmark for depth estimation from spherical panoramas. Its goal is to drive progress for this task in a consistent and holistic manner. The Pano3D 360 depth estimation benchmark provides a standard Matterport3D train and test split, as well as a secondary GibsonV2 partioning for testing and training as well. The latter is used for zero-shot cross dataset transfer performance assessment and decomposes it into 3 different splits, each one focusing on a specific generalization axis.

2 PAPERS • NO BENCHMARKS YET

SuperCaustics

SuperCaustics is a simulation tool made in Unreal Engine for generating massive computer vision datasets that include transparent objects.

2 PAPERS • 1 BENCHMARK

Autonomous-driving Streaming Perception Benchmarrk

The Autonomous-driving StreAming Perception (ASAP) benchmark is a benchmark to evaluate the online performance of vision-centric perception in autonomous driving. It extends the 2Hz annotated nuScenes dataset by generating high-frame-rate labels for the 12Hz raw images.

1 PAPER • NO BENCHMARKS YET

DermSynth3D

DermSynth3D (3DBodyTex.DermSynth3D)

A dataset of 100K synthetic images of skin lesions, ground-truth (GT) segmentations of lesions and healthy skin, GT segmentations of seven body parts (head, torso, hips, legs, feet, arms and hands), and GT binary masks of non-skin regions in the texture maps of 215 scans from the 3DBodyTex.v1 dataset [2], [3] created using the framework described in [1]. The dataset is primarily intended to enable the development of skin lesion analysis methods. Synthetic image creation consisted of two main steps. First, skin lesions from the Fitzpatrick 17k dataset were blended onto skin regions of high-resolution three-dimensional human scans from the 3DBodyTex dataset [2], [3]. Second, two-dimensional renders of the modified scans were generated.

1 PAPER • NO BENCHMARKS YET

HAMMER

HAMMER dataset contains 13 Scenes. Each scene has two setups, with/without objects (with : scene includes several objects with various surface material, without : scene with only backgrounds - naked) and each scene has two camera trajectories. Each trajectories composed with roughly 300 frames, which adds up to 16k frames in total (13 x 2 x 2 x 300). Each trajectory contains corresponding images from each cameras : d435 – stereo, l515 – Lidar (D-ToF), polarization – RGBP (RGB with polarization), tof – (I-ToF). Each camera folder contains its intrinsic file and its own recorded images together with rendered depth GT / instance GT and camera pose. All the cameras are fully synchronized via robotic arm’s data acquisition setup.

1 PAPER • NO BENCHMARKS YET

INRIA DLFD (INRIA Dense Light Field)

The INRIA Dense Light Field Dataset (DLFD) is a dataset for testing depth estimation methods in a light field. DLFD contains 39 scenes with disparity range [-4,4] pixels. The light fields are of spatial resolution 512 x 512 and angular resolution 9 x 9.

1 PAPER • NO BENCHMARKS YET

Datasets

54 dataset results for Depth Estimation AND Images