A dataset capturing diverse visual data formats that target varying luminance conditions, and was recorded from alternative vision sensors, by handheld or mounted on a car, repeatedly in the same space but in different conditions.
10 PAPERS • NO BENCHMARKS YET
xR-EgoPose is an egocentric synthetic dataset for egocentric 3D human pose estimation. It consists of ~380 thousand photo-realistic egocentric camera images in a variety of indoor and outdoor spaces.
3D AffordanceNet is a dataset of 23k shapes for visual affordance. It consists of 56,307 well-defined affordance information annotations for 22,949 shapes covering 18 affordance classes and 23 semantic object categories.
9 PAPERS • 1 BENCHMARK
BLVD is a large scale 5D semantics dataset collected by the Visual Cognitive Computing and Intelligent Vehicles Lab. This dataset contains 654 high-resolution video clips owing 120k frames extracted from Changshu, Jiangsu Province, China, where the Intelligent Vehicle Proving Center of China (IVPCC) is located. The frame rate is 10fps/sec for RGB data and 3D point cloud. The dataset contains fully annotated frames which yield 249,129 3D annotations, 4,902 independent individuals for tracking with the length of overall 214,922 points, 6,004 valid fragments for 5D interactive event recognition, and 4,900 individuals for 5D intention prediction. These tasks are contained in four kinds of scenarios depending on the object density (low and high) and light conditions (daytime and nighttime).
9 PAPERS • NO BENCHMARKS YET
BuildingNet is a large-scale dataset of 3D building models whose exteriors are consistently labeled. The dataset consists on 513K annotated mesh primitives, grouped into 292K semantic part components across 2K building models. The dataset covers several building categories, such as houses, churches, skyscrapers, town halls, libraries, and castles.
Unite The People is a dataset for 3D body estimation. The images come from the Leeds Sports Pose dataset and its extended version, as well as the single person tagged people from the MPII Human Pose Dataset. The images are labeled with different types of annotations such as segmentation labels, pose or 3D.
3DFAW contains 23k images with 66 3D face keypoint annotations.
8 PAPERS • 2 BENCHMARKS
4D-OR includes a total of 6734 scenes, recorded by six calibrated RGB-D Kinect sensors 1 mounted to the ceiling of the OR, with one frame-per-second, providing synchronized RGB and depth images. We provide fused point cloud sequences of entire scenes, automatically annotated human 6D poses and 3D bounding boxes for OR objects. Furthermore, we provide SSG annotations for each step of the surgery together with the clinical roles of all the humans in the scenes, e.g., nurse, head surgeon, anesthesiologist.
8 PAPERS • 1 BENCHMARK
We provide manual annotations of 14 semantic keypoints for 100,000 car instances (sedan, suv, bus, and truck) from 53,000 images captured from 18 moving cameras at Multiple intersections in Pittsburgh, PA. Please fill the google form to get a email with the download links:
The ETH SfM (structure-from-motion) dataset is a dataset for 3D Reconstruction. The benchmark investigates how different methods perform in terms of building a 3D model from a set of available 2D images.
8 PAPERS • NO BENCHMARKS YET
The Fusion 360 Gallery Dataset contains rich 2D and 3D geometry data derived from parametric CAD models. The dataset is produced from designs submitted by users of the CAD package Autodesk Fusion 360 to the Autodesk Online Gallery. The dataset provides valuable data for learning how people design, including sequential CAD design data, designs segmented by modelling operation, and design hierarchy and connectivity data.
Housekeep a benchmark to evaluate common sense reasoning in the home for embodied AI. In Housekeep, an embodied agent must tidy a house by rearranging misplaced objects without explicit instructions specifying which objects need to be rearranged. The dataset contains where humans typically place objects in tidy and untidy houses constituting 1799 objects, 268 object categories, 585 placements, and 105 rooms.
Thingi10K is a dataset of 3D-Printing Models. Specifically there are 10,000 models from featured “things” on thingiverse.com, suitable for testing 3D printing techniques such as structural analysis , shape optimization, or solid geometry operations.
UnrealEgo is a dataset that provides in-the-wild stereo images with a large variety of motions for 3D human pose estimation. The in-the-wild stereo images are stereo fisheye images and depth maps with a resolution of 1024×1024 pixels each with 25 frames per second and a total of 450k (900k images) are captured for the dataset. Metadata is provided for each frame, including 3D joint positions, camera positions, and 2D coordinates of reprojected joint positions in the fisheye views.
AKB-48 is a large-scale Articulated object Knowledge Base which consists of 2,037 real-world 3D articulated object models of 48 categories.
7 PAPERS • NO BENCHMARKS YET
The Argoverse 2 Sensor Dataset is a collection of 1,000 scenarios with 3D object tracking annotations. Each sequence in our training and validation sets includes annotations for all objects within five meters of the “drivable area” — the area in which it is possible for a vehicle to drive. The HD map for each scenario specifies the driveable area.
BIKED is a dataset comprised of 4500 individually designed bicycle models sourced from hundreds of designers. BIKED enables a variety of data-driven design applications for bicycles and generally supports the development of data-driven design methods. The dataset is comprised of a variety of design information including assembly images, component images, numerical design parameters, and class labels.
FPv1 (prior name FAUST-partial) is a 3D registration benchmark dataset created to address the lack of data variability in the existing 3D registration benchmarks such as: 3DMatch, ETH, KITTI.
7 PAPERS • 1 BENCHMARK
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Hi4D contains 4D textured scans of 20 subject pairs, 100 sequences, and a total of more than 11K frames. Hi4D contains rich interaction centric annotations in 2D and 3D alongside accurately registered parametric body models.
The 2021 Kidney and Kidney Tumor Segmentation challenge (abbreviated KiTS21) is a competition in which teams compete to develop the best system for automatic semantic segmentation of renal tumors and surrounding anatomy.
This is a 3D action recognition dataset, also known as 3D Action Pairs dataset. The actions in this dataset are selected in pairs such that the two actions of each pair are similar in motion (have similar trajectories) and shape (have similar objects); however, the motion-shape relation is different.
Accurate lesion segmentation is critical in stroke rehabilitation research for the quantification of lesion burden and accurate image processing. Current automated lesion segmentation methods for T1-weighted (T1w) MRIs, commonly used in rehabilitation research, lack accuracy and reliability. Manual segmentation remains the gold standard, but it is time-consuming, subjective, and requires significant neuroanatomical expertise. However, many methods developed with ATLAS v1.2 report low accuracy, are not publicly accessible or are improperly validated, limiting their utility to the field. Here we present ATLAS v2.0 (N=1271), a larger dataset of T1w stroke MRIs and manually segmented lesion masks that includes training (public. n=655), test (masks hidden, n=300), and generalizability (completely hidden, n=316) data. Algorithm development using this larger sample should lead to more robust solutions, and the hidden test and generalizability datasets allow for unbiased performance evaluation
6 PAPERS • 1 BENCHMARK
CMD is a publicly available collection of hundreds of thousands 2D maps and 3D grids containing different properties of the gas, dark matter, and stars from more than 2,000 different universes. The data has been generated from thousands of state-of-the-art (magneto-)hydrodynamic and gravity-only N-body simulations from the CAMELS project.
6 PAPERS • NO BENCHMARKS YET
Falling Things (FAT) is a dataset for advancing the state-of-the-art in object detection and 3D pose estimation in the context of robotics. It consists of generated photorealistic images with accurate 3D pose annotations for all objects in 60k images.
Tasks. In moving object segmentation of point cloud sequences, one has to provide motion labels for each point of the test sequences 11-21. Therefore, the input to all evaluated methods is a list of coordinates of the three-dimensional points along with their remission, i.e., the strength of the reflected laser beam which depends on the properties of the surface that was hit. Each method should then output a label for each point of a scan, i.e., one full turn of the rotating LiDAR sensor. Here, we only distinguish between static and moving object classes.
The MMBody dataset provides human body data with motion capture, GT mesh, Kinect RGBD, and millimeter wave sensor data. See homepage for more details.
Memory Maze is a 3D domain of randomized mazes designed for evaluating the long-term memory abilities of RL agents. Memory Maze isolates long-term memory from confounding challenges, such as exploration, and requires remembering several pieces of information: the positions of objects, the wall layout, and keeping track of agent’s own position.
Mindboggle is a large publicly available dataset of manually labeled brain MRI. It consists of 101 subjects collected from different sites, with cortical meshes varying from 102K to 185K vertices. Each brain surface contains 25 or 31 manually labeled parcels.
Generate high-quality 3D ground-truth shapes for reconstruction evaluation is extremely challenging because even 3D scanners can only generate pseudo ground-truth shapes with artefacts. We propose a novel data capturing and 3D annotation pipeline to obtain precise 3D ground-truth shapes without relying on expensive 3D scanners. The key to creating the precise 3D ground-truth shapes is using LEGO models, which are made of LEGO bricks with known geometry. The MobileBrick dataset provides a unique opportunity for future research on high-quality 3D reconstruction thanks to two distinctive features: 1) A large number of RGBD sequences with precise 3D ground-truth annotations. 2) The RGBD images were captured using mobile devices so algorithms can be tested in a realistic setup for mobile AR applications.
MonoPerfCap is a benchmark dataset for human 3D performance capture from monocular video input consisting of around 40k frames, which covers a variety of different scenarios.
Understanding spatial relations (e.g., “laptop on table”) in visual input is important for both humans and robots. Existing datasets are insufficient as they lack largescale, high-quality 3D ground truth information, which is critical for learning spatial relations. In this paper, we fill this gap by constructing Rel3D: the first large-scale, human-annotated dataset for grounding spatial relations in 3D. Rel3D enables quantifying the effectiveness of 3D information in predicting spatial relations on large-scale human data. Moreover, we propose minimally contrastive data collection—a novel crowdsourcing method for reducing dataset bias. The 3D scenes in our dataset come in minimally contrastive pairs: two scenes in a pair are almost identical, but a spatial relation holds in one and fails in the other. We empirically validate that minimally contrastive examples can diagnose issues with current relation detection models as well as lead to sample-efficient training. Code and data are avai
T$^3$Bench is the first comprehensive text-to-3D benchmark containing diverse text prompts of three increasing complexity levels that are specially designed for 3D generation (300 prompts in total). To assess both the subjective quality and the text alignment, we propose two automatic metrics based on multi-view images produced by the 3D contents. The quality metric combines multi-view text-image scores and regional convolution to detect quality and view inconsistency. The alignment metric uses multi-view captioning and Large Language Model (LLM) evaluation to measure text-3D consistency.
Collected in the snow belt region of Michigan's Upper Peninsula, WADS is the first multi-modal dataset featuring dense point-wise labeled sequential LiDAR scans collected in severe winter weather.
3D-FRONT (3D Furnished Rooms with layOuts and semaNTics) is large-scale, and comprehensive repository of synthetic indoor scenes highlighted by professionally designed layouts and a large number of rooms populated by high-quality textured 3D models with style compatibility. From layout semantics down to texture details of individual objects, the dataset is freely available to the academic community and beyond.
5 PAPERS • NO BENCHMARKS YET
A Large Dataset of Object Scans is a dataset of more than ten thousand 3D scans of real objects. To create the dataset, the authors recruited 70 operators, equipped them with consumer-grade mobile 3D scanning setups, and paid them to scan objects in their environments. The operators scanned objects of their choosing, outside the laboratory and without direct supervision by computer vision professionals. The result is a large and diverse collection of object scans: from shoes, mugs, and toys to grand pianos, construction vehicles, and large outdoor sculptures. The authors worked with an attorney to ensure that data acquisition did not violate privacy constraints. The acquired data was placed in the public domain and is available freely.
DurLAR is a high-fidelity 128-channel 3D LiDAR dataset with panoramic ambient (near infrared) and reflectivity imagery for multi-modal autonomous driving applications. Compared to existing autonomous driving task datasets, DurLAR has the following novel features:
FAUST-partial is a 3D registration benchmark dataset created to provide a more informative evaluation of 3D registration methods. The dataset addresses two main limitations of current 3D registration benchmarks:
5 PAPERS • 9 BENCHMARKS
Cancer in the region of the head and neck (HaN) is one of the most prominent cancers, for which radiotherapy represents an important treatment modality that aims to deliver a high radiation dose to the targeted cancerous cells while sparing the nearby healthy organs-at-risk (OARs). A precise three-dimensional spatial description, i.e. segmentation, of the target volumes as well as OARs is required for optimal radiation dose distribution calculation, which is primarily performed using computed tomography (CT) images. However, the HaN region contains many OARs that are poorly visible in CT, but better visible in magnetic resonance (MR) images. Although attempts have been made towards the segmentation of OARs from MR images, so far there has been no evaluation of the impact the combined analysis of CT and MR images has on the segmentation of OARs in the HaN region. The Head and Neck Organ-at-Risk Multi-Modal Segmentation Challenge aims to promote the development of new and application of
Kitchen Scenes is a multi-view RGB-D dataset of nine kitchen scenes, each containing several objects in realistic cluttered environments including a subset of objects from the BigBird dataset. The viewpoints of the scenes are densely sampled and objects in the scenes are annotated with bounding boxes and in the 3D point cloud.
5 PAPERS • 1 BENCHMARK
PedX is a large-scale multi-modal collection of pedestrians at complex urban intersections. The dataset provides high-resolution stereo images and LiDAR data with manual 2D and automatic 3D annotations. The data was captured using two pairs of stereo cameras and four Velodyne LiDAR sensors.
QDax is a benchmark suite designed for for Deep Neuroevolution in Reinforcement Learning domains for robot control. The suite includes the definition of tasks, environments, behavioral descriptors, and fitness. It specify different benchmarks based on the complexity of both the task and the agent controlled by a deep neural network. The benchmark uses standard Quality-Diversity metrics, including coverage, QD-score, maximum fitness, and an archive profile metric to quantify the relation between coverage and fitness.
The RAD-ChestCT dataset is a large medical imaging dataset developed by Duke MD/PhD Rachel Draelos during her Computer Science PhD supervised by Lawrence Carin. The full dataset includes 35,747 chest CT scans from 19,661 adult patients. The public Zenodo repository contains an initial release of 3,630 chest CT scans, approximately 10% of the dataset. This dataset is of significant interest to the machine learning and medical imaging research communities.
SPACE is a simulator for physical Interactions and causal learning in 3D environments. The SPACE simulator is used to generate the SPACE dataset, a synthetic video dataset in a 3D environment, to systematically evaluate physics-based models on a range of physical causal reasoning tasks. Inspired by daily object interactions, the SPACE dataset comprises videos depicting three types of physical events: containment, stability and contact.
VIPER is a benchmark suite for visual perception. The benchmark is based on more than 250K high-resolution video frames, all annotated with ground-truth data for both low-level and high-level vision tasks, including optical flow, semantic instance segmentation, object detection and tracking, object-level 3D scene layout, and visual odometry. Ground-truth data for all tasks is available for every frame. The data was collected while driving, riding, and walking a total of 184 kilometers in diverse ambient conditions in a realistic virtual world.
MICCAI Challenge on Circuit Reconstruction from Electron Microscopy Images.
4 PAPERS • 1 BENCHMARK
Human Bodies in the Wild (HBW) is a validation and test set for body shape estimation. It consists of images taken in the wild and ground truth 3D body scans in SMPL-X topology. To create HBW, we collect body scans of 35 participants and register the SMPL-X model to the scans. Further each participant is photographed in various outfits and poses in front of a white background and uploads full-body photos of themselves taken in the wild. The validation and test set images are released. The ground truth shape is only released for the validation set.
4 PAPERS • NO BENCHMARKS YET
HOD is a dataset for 3D object reconstruction which contains 35 objects, divided into two subsets named Sculptures and Daily Objects. The Sculptures has five human sculptures with complex geometries and pure white textures. The Daily Objects consists of 30 daily objects with various shapes and appearances. All of the Sculptures and nine of the Daily Objects are paired with high-fidelity scanned meshes as ground truth geometries for evaluation.