KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile robotics and autonomous driving. It consists of hours of traffic scenarios recorded with a variety of sensor modalities, including high-resolution RGB, grayscale stereo cameras, and a 3D laser scanner. Despite its popularity, the dataset itself does not contain ground truth for semantic segmentation. However, various researchers have manually annotated parts of the dataset to fit their necessities. Álvarez et al. generated ground truth for 323 images from the road detection challenge with three classes: road, vertical, and sky. Zhang et al. annotated 252 (140 for training and 112 for testing) acquisitions – RGB and Velodyne scans – from the tracking challenge for ten object categories: building, sky, road, vegetation, sidewalk, car, pedestrian, cyclist, sign/pole, and fence. Ros et al. labeled 170 training images and 46 testing images (from the visual odome
1,601 PAPERS • 93 BENCHMARKS
ShapeNet is a large scale repository for 3D CAD models developed by researchers from Stanford University, Princeton University and the Toyota Technological Institute at Chicago, USA. The repository contains over 300M models with 220,000 classified into 3,135 classes arranged using WordNet hypernym-hyponym relationships. ShapeNet Parts subset contains 31,693 meshes categorised into 16 common object classes (i.e. table, chair, plane etc.). Each shapes ground truth contains 2-5 parts (with a total of 50 part classes).
735 PAPERS • 10 BENCHMARKS
The SYNTHIA dataset is a synthetic dataset that consists of 9400 multi-viewpoint photo-realistic frames rendered from a virtual city and comes with pixel-level semantic annotations for 13 classes. Each frame has resolution of 1280 × 960.
238 PAPERS • 6 BENCHMARKS
BlendedMVS is a novel large-scale dataset, to provide sufficient training ground truth for learning-based MVS. The dataset was created by applying a 3D reconstruction pipeline to recover high-quality textured meshes from images of well-selected scenes. Then, these mesh models were rendered to color images and depth maps.
11 PAPERS • NO BENCHMARKS YET
The PhotoShape dataset consists of photorealistic, relightable, 3D shapes produced by the work proposed in the work of Park et al. (2021).
3 PAPERS • 1 BENCHMARK
ACID consists of thousands of aerial drone videos of different coastline and nature scenes on YouTube. Structure-from-motion is used to get camera poses.
2 PAPERS • 1 BENCHMARK
The UASOL an RGB-D stereo dataset, that contains 160902 frames, filmed at 33 different scenes, each with between 2 k and 10 k frames. The frames show different paths from the perspective of a pedestrian, including sidewalks, trails, roads, etc. The images were extracted from video files with 15 fps at HD2K resolution with a size of 2280 × 1282 pixels. The dataset also provides a GPS geolocalization tag for each second of the sequences and reflects different climatological conditions. It also involved up to 4 different persons filming the dataset at different moments of the day.
2 PAPERS • 1 BENCHMARK
RealEstate10K is a large dataset of camera poses corresponding to 10 million frames derived from about 80,000 video clips, gathered from about 10,000 YouTube videos. For each clip, the poses form a trajectory where each pose specifies the camera position and orientation along the trajectory. These poses are derived by running SLAM and bundle adjustment algorithms on a large set of videos.
1 PAPER • 1 BENCHMARK
The shiny folder contains 8 scenes with challenging view-dependent effects used in our paper. We also provide additional scenes in the shiny_extended folder. The test images for each scene used in our paper consist of one of every eight images in alphabetical order.
1 PAPER • NO BENCHMARKS YET