ShapeNetCore is a subset of the full ShapeNet dataset with single clean 3D models and manually verified category and alignment annotations. It covers 55 common object categories with about 51,300 unique 3D models. The 12 object categories of PASCAL 3D+, a popular computer vision 3D benchmark dataset, are all covered by ShapeNetCore.
145 PAPERS • 1 BENCHMARK
CoMA contains 17,794 meshes of the human face in various expressions
69 PAPERS • 1 BENCHMARK
Scan2CAD is an alignment dataset based on 1506 ScanNet scans with 97607 annotated keypoints pairs between 14225 (3049 unique) CAD models from ShapeNet and their counterpart objects in the scans. The top 3 annotated model classes are chairs, tables and cabinets which arises due to the nature of indoor scenes in ScanNet. The number of objects aligned per scene ranges from 1 to 40 with an average of 9.3.
56 PAPERS • 1 BENCHMARK
For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real images. Hypersim is a photorealistic synthetic dataset for holistic indoor scene understanding. It contains 77,400 images of 461 indoor scenes with detailed per-pixel labels and corresponding ground truth geometry.
41 PAPERS • 1 BENCHMARK
Gait3D is a large-scale 3D representation-based gait recognition dataset. It contains 4,000 subjects and over 25,000 sequences extracted from 39 cameras in an unconstrained indoor scene.
29 PAPERS • 2 BENCHMARKS
Dataset containing RGB-D data of 4 large scenes, comprising a total of 12 rooms, for the purpose of RGB and RGB-D camera relocalization. The RGB-D data was captured using a Structure.io depth sensor coupled with an iPad color camera. Each room was scanned multiple times, with the multiple sequences run through a global bundle adjustment in order to obtain globally aligned camera poses though all sequences of the same scene.
26 PAPERS • NO BENCHMARKS YET
BEAT has i) 76 hours, high-quality, multi-modal data captured from 30 speakers talking with eight different emotions and in four different languages, ii) 32 millions frame-level emotion and semantic relevance annotations. Our statistical analysis on BEAT demonstrates the correlation of conversational gestures with \textit{facial expressions}, \textit{emotions}, and \textit{semantics}, in addition to the known correlation with \textit{audio}, \textit{text}, and \textit{speaker identity}. Based on this observation, we propose a baseline model, \textbf{Ca}scaded \textbf{M}otion \textbf{N}etwork \textbf{(CaMN)}, which consists of above six modalities modeled in a cascaded architecture for gesture synthesis. To evaluate the semantic relevancy, we introduce a metric, Semantic Relevance Gesture Recall (\textbf{SRGR}). Qualitative and quantitative experiments demonstrate metrics' validness, ground truth data quality, and baseline's state-of-the-art performance. To the best of our knowledge,
23 PAPERS • 1 BENCHMARK
The goal of this benchmark is to introduce a standard evaluation metric to measure the accuracy and robustness of 3D face reconstruction methods under variations in viewing angle, lighting, and common occlusions.
The REALY benchmark aims to introduce a region-aware evaluation pipeline to measure the fine-grained normalized mean square error (NMSE) of 3D face reconstruction methods from under-controlled image sets.
20 PAPERS • 2 BENCHMARKS
Shape matching plays an important role in geometry processing and shape analysis. In the last decades, much research has been devoted to improve the quality of matching between surfaces. This huge effort is motivated by several applications such as object retrieval, animation and information transfer just to name a few. Shape matching is usually divided into two main categories: rigid and non rigid matching. In both cases, the standard evaluation is usually performed on shapes that share the same connectivity, in other words, shapes represented by the same mesh. This is mainly due to the availability of a “natural” ground truth that is given for these shapes. Indeed, in most cases the consistent connectivity directly induces a ground truth correspondence between vertices. However, this standard practice obviously does not allow to estimate the robustness of a method with respect to different connectivity. With this track, we propose a benchmark to evaluate the performance of point-to-p
15 PAPERS • 1 BENCHMARK
The ScanNet200 benchmark studies 200-class 3D semantic segmentation - an order of magnitude more class categories than previous 3D scene understanding benchmarks. The source of scene data is identical to ScanNet, but parses a larger vocabulary for semantic and instance segmentation
15 PAPERS • 2 BENCHMARKS
SSP-3D is an evaluation dataset consisting of 311 images of sportspersons in tight-fitted clothes, with a variety of body shapes and poses. The images were collected from the Sports-1M dataset. SSP-3D is intended for use as a benchmark for body shape prediction methods. Pseudo-ground-truth 3D shape labels (using the SMPL body model) were obtained via multi-frame optimisation with shape consistency between frames, as described here.
13 PAPERS • 1 BENCHMARK
OmniObject3D is a large vocabulary 3D object dataset with massive high-quality real-scanned 3D objects. OmniObject3D has several appealing properties:
10 PAPERS • NO BENCHMARKS YET
3D AffordanceNet is a dataset of 23k shapes for visual affordance. It consists of 56,307 well-defined affordance information annotations for 22,949 shapes covering 18 affordance classes and 23 semantic object categories.
8 PAPERS • 1 BENCHMARK
The CAPE dataset is a 3D dynamic dataset of clothed humans, featuring:
7 PAPERS • 1 BENCHMARK
Breaking Bad is a large-scale dataset of fractured objects. The dataset contains around 10k meshes from PartNet and Thingi10k. For each mesh, 20 fracture modes are pre-computed and then simulate 80 fractures from them, resulting in a total of 1M breakdown patterns. This dataset serves as a benchmark that enables the study of fractured object reassembly and presents new challenges for geometric shape understanding.
6 PAPERS • NO BENCHMARKS YET
BuildingNet is a large-scale dataset of 3D building models whose exteriors are consistently labeled. The dataset consists on 513K annotated mesh primitives, grouped into 292K semantic part components across 2K building models. The dataset covers several building categories, such as houses, churches, skyscrapers, town halls, libraries, and castles.
5 PAPERS • 1 BENCHMARK
The Habitat-Matterport 3D Semantics Dataset (HM3DSem) is the largest-ever dataset of 3D real-world and indoor spaces with densely annotated semantics that is available to the academic community. HM3DSem v0.2 consists of 142,646 object instance annotations across 216 3D-spaces from HM3D and 3,100 rooms within those spaces. The HM3D scenes are annotated with the 142,646 raw object names, which are mapped to 40 Matterport categories. On average, each scene in HM3DSem v0.2 consists of 661 objects from 106 categories. This dataset is the result of 14,200+ hours of human effort for annotation and verification by 20+ annotators.
5 PAPERS • NO BENCHMARKS YET
Articulated Mesh Animation (AMA) is a real-world dataset containing 10 mesh sequences depicting 3 different humans performing various actions
4 PAPERS • NO BENCHMARKS YET
ARCTIC is a dataset of free-form interactions of hands and articulated objects. ARCTIC has 1.2M images paired with accurate 3D meshes for both hands and for objects that move and deform over time. The dataset also provides hand-object contact information.
BIWI 3D corpus comprises a total of 1109 sentences uttered by 14 native English speakers (6 males and 8 females). A real time 3D scanner and a professional microphone were used to capture the facial movements and the speech of the speakers. The dense dynamic face scans were acquired at 25 frames per second and the RMS error in the 3D reconstruction is about 0.5 mm. In order to ease automatic speech segmentation, we carried out the recordings in a anechoic room, with walls covered by sound wave-absorbing materials.
4 PAPERS • 1 BENCHMARK
The MMBody dataset provides human body data with motion capture, GT mesh, Kinect RGBD, and millimeter wave sensor data. See homepage for more details.
Building3D is an urban-scale dataset consisting of more than 160 thousands buildings along with corresponding point clouds, mesh and wireframe models, covering 16 cities in Estonia about 998 Km2. Besides mesh models and real-world LiDAR point clouds, it also includes wireframe models.
2 PAPERS • NO BENCHMARKS YET
The NVIDIA HOPE datasets consist of RGBD images and video sequences with labeled 6-DoF poses for 28 toy grocery objects. The toy grocery objects are readily available for purchase and have ideal size and weight for robotic manipulation. 3D textured meshes for generating synthetic training data are provided.
Teeth3DS is the first public benchmark which has been created in the frame of the 3DTeethSeg 2022 MICCAI challenge to boost the research field and inspire the 3D vision research community to work on intra-oral 3D scans analysis such as teeth identification, segmentation, labeling, 3D modeling, and 3D reconstruction. Teeth3DS is made of 1800 intra-oral scans (23999 annotated teeth) collected from 900 patients covering the upper and lower jaws separately, acquired and validated by orthodontists/dental surgeons with more than 5 years of professional experience.
It comprises synthetic mesh sequences from Deformation Transfer for Triangle Meshes.
1 PAPER • 1 BENCHMARK
A dataset of 100K synthetic images of skin lesions, ground-truth (GT) segmentations of lesions and healthy skin, GT segmentations of seven body parts (head, torso, hips, legs, feet, arms and hands), and GT binary masks of non-skin regions in the texture maps of 215 scans from the 3DBodyTex.v1 dataset [2], [3] created using the framework described in [1]. The dataset is primarily intended to enable the development of skin lesion analysis methods. Synthetic image creation consisted of two main steps. First, skin lesions from the Fitzpatrick 17k dataset were blended onto skin regions of high-resolution three-dimensional human scans from the 3DBodyTex dataset [2], [3]. Second, two-dimensional renders of the modified scans were generated.
1 PAPER • NO BENCHMARKS YET
Copyright (C) 2021 Ante Qu antequ@cs.stanford.edu.
The volumetric representation of human interactions is one of the fundamental domains in the development of immersive media productions and telecommunication applications. Particularly in the context of the rapid advancement of Extended Reality (XR) applications, this volumetric data has proven to be an essential technology for future XR elaboration. In this work, we present a new multimodal database to help advance the development of immersive technologies. Our proposed database provides ethically compliant and diverse volumetric data, in particular 27 participants displaying posed facial expressions and subtle body movements while speaking, plus 11 participants wearing head-mounted displays (HMDs). The recording system consists of a volumetric capture (VoCap) studio, including 31 synchronized modules with 62 RGB cameras and 31 depth cameras. In addition to textured meshes, point clouds, and multi-view RGB-D data, we use one Lytro Illum camera for providing light field (LF) data simul
The HOPE-Video dataset contains 10 video sequences (2038 frames) with 5-20 objects on a tabletop scene captured by a robot arm-mounted RealSense D415 RGBD camera. In each sequence, the camera is moved to capture multiple views of a set of objects in the robotic workspace. First COLMAP was applied to refine the camera poses (keyframes at 6~fps) provided by forward kinematics and RGB calibration from RealSense to Baxter's wrist camera. 3D dense point cloud was then generated via CascadeStereo (included for each sequence in 'scene.ply'). Ground truth poses for the HOPE objects models in the world coordinate system were annotated manually using the CascadeStereo point clouds. The following are provided for each frame:
Hi4D contains 4D textured scans of 20 subject pairs, 100 sequences, and a total of more than 11K frames. Hi4D contains rich interaction centric annotations in 2D and 3D alongside accurately registered parametric body models.
KITTI-6DoF is a dataset that contains annotations for the 6DoF estimation task for 5 object categories on 7,481 frames.
The ObjectFolder Real dataset contains multisensory data collected from 100 real-world household objects. The visual data for each object include three high-quality 3D meshes of different resolutions and an HD video recording of the object rotating in a lightbox; The acoustic data for each object include impact sound recordings recorded at 30–50 points of the object, each of which is 6s long and is accompanied by the coordinate of the striking location on the object mesh, ground-truth contact force profile, and the accompanying video for the impact. The tactile data for each object include tactile readings at the same 30–50 points of the object, with each tactile reading as a video of the tactile RGB images that record the entire gel deformation process and is accompanied by two videos of the contact process from an in-hand camera and a third-view camera.
PLAD is a dataset where sparse depth is provided by line-based visual SLAM to verify StructMDC.
PaintNet is a dataset for learning robotic spray painting of free-form 3D objects. PaintNet includes more than 800 object meshes and the associated painting strokes collected in a real industrial setting.
Psychological trait estimation from external factors such as movement and appearance is a challenging and long-standing problem in psychology, and is principally based on the psychological theory of embodiment. To date, attempts to tackle this problem have utilized private small-scale datasets with intrusive body-attached sensors. Potential applications of an automated system for psychological trait estimation include estimation of occupational fatigue and psychology, and marketing and advertisement. In this work, we propose PsyMo (Psychological traits from Motion), a novel, multi-purpose and multi-modal dataset for exploring psychological cues manifested in walking patterns. We gathered walking sequences from 312 subjects in 7 different walking variations and 6 camera angles. In conjunction with walking sequences, participants filled in 6 psychological questionnaires, totalling 17 psychometric attributes related to personality, self-esteem, fatigue, aggressiveness and mental health. W
The Robot Tracking Benchmark (RTB) is a synthetic dataset that facilitates the quantitative evaluation of 3D tracking algorithms for multi-body objects. It was created using the procedural rendering pipeline BlenderProc. The dataset contains photo-realistic sequences with HDRi lighting and physically-based materials. Perfect ground truth annotations for camera and robot trajectories are provided in the BOP format. Many physical effects, such as motion blur, rolling shutter, and camera shaking, are accurately modeled to reflect real-world conditions. For each frame, four depth qualities exist to simulate sensors with different characteristics. While the first quality provides perfect ground truth, the second considers measurements with the distance-dependent noise characteristics of the Azure Kinect time-of-flight sensor. Finally, for the third and fourth quality, two stereo RGB images with and without a pattern from a simulated dot projector were rendered. Depth images were then recons
Robot@Home2, is an enhanced version aimed at improving usability and functionality for developing and testing mobile robotics and computer vision algorithms. Robot@Home2 consists of three main components. Firstly, a relational database that states the contextual information and data links, compatible with Standard Query Language. Secondly,a Python package for managing the database, including downloading, querying, and interfacing functions. Finally, learning resources in the form of Jupyter notebooks, runnable locally or on the Google Colab platform, enabling users to explore the dataset without local installations. These freely available tools are expected to enhance the ease of exploiting the Robot@Home dataset and accelerate research in computer vision and robotics.
A Benchmark Dataset for Deep Learning-based Methods for 3D Topology Optimization.
TDMD contains eight reference DCM objects with six typical distortions. Using processed video sequences (PVS) derived from the DCM, the authors conducted a large-scale subjective experiment that resulted in 303 distorted DCM samples with mean opinion scores, making the TDMD the largest available DCM database to our knowledge.
The RBO dataset of articulated objects and interactions is a collection of 358 RGB-D video sequences (67:18 minutes) of humans manipulating 14 articulated objects under varying conditions (light, perspective, background, interaction). All sequences are annotated with ground truth of the poses of the rigid parts and the kinematic state of the articulated object (joint states) obtained with a motion capture system. We also provide complete kinematic models of these objects (kinematic structure and three-dimensional textured shape models). In 78 sequences the contact wrenches during the manipulation are also provided.
The dataset contains procedurally generated images of transparent vessels containing liquid and objects . The data for each image includes segmentation maps, 3d depth maps, and normal maps of of the liquid or object inside the transparent vessel, and the vessel. In addition, the properties of the materials inside the containers are given(color/transparency/roughness/metalness). In addition, a natural image benchmark for the 3d/depth estimation of objects inside transparent containers is supplied. 3d models of the objects (GTLF) are also supplied.
A dataset for voice and 3D face structure study. It contains about 1.4K identities with their 3D face models and voice data. 3D face models are fitted from VGGFace using BFM 3D models, and voice data are processed from Voxceleb
The provided dataset consists of high-quality realistic head models and combined EEG/MEG data which can be used for state-of-the-art methods in brain research, such as modern finite element methods (FEM) to compute the EEG/MEG forward problems using the software toolbox DUNEuro (http://duneuro.org).
Please see our website and code repository for detailed description.
Reflectance measurements of Bidirectional Texture Functions (BTFs)
0 PAPER • NO BENCHMARKS YET
The field of biomechanics is at a turning point, with marker-based motion capture set to be replaced by portable and inexpensive hardware, rapidly improving markerless tracking algorithms, and open datasets that will turn these new technologies into field-wide team projects. To expedite progress in this direction, we have collected the CMU Panoptic Dataset 2.0, which contains 86 subjects captured with 140 VGA cameras, 31 HD cameras, and 15 IMUs, performing on average 6.5 min of activities, including range of motion activities and tasks of daily living.
InfiniteRep is a synthetic, open-source dataset for fitness and physical therapy (PT) applications. It includes 1k videos of diverse avatars performing multiple repetitions of common exercises. It includes significant variation in the environment, lighting conditions, avatar demographics, and movement trajectories. From cadence to kinematic trajectory, each rep is done slightly differently -- just like real humans. InfiniteRep videos are accompanied by a rich set of pixel-perfect labels and annotations, including frame-specific repetition counts.