To the best of our knowledge, MDIF is the first deep implicit function model that can at the same time (1) represent different levels of detail and allow progressive decoding; (2) support both encoder-decoder inference and decoder-only latent optimization, and fulfill multiple applications; (3) perform detailed decoder-only shape completion.
Another approach is to concatenate all the modalities into a tuple and then contrast positive and negative tuple correspondences.
The ability to communicate intention enables decentralized multi-agent robots to collaborate while performing physical tasks.
Unlike neural scene representation work that optimizes per-scene functions for rendering, we learn a generic view interpolation function that generalizes to novel scenes.
Self-supervised representation learning is a critical problem in computer vision, as it provides a way to pretrain feature extractors on large unlabeled datasets that can be used as an initialization for more efficient and effective training on downstream tasks.
Localizing the camera in a known indoor environment is a key building block for scene mapping, robot navigation, AR, etc.
We propose the task of forecasting characteristic 3D poses: from a monocular video observation of a person, to predict a future 3D pose of that person in a likely action-defining, characteristic pose - for instance, from observing a person reaching for a banana, predict the pose of the person eating the banana.
Many applications in 3D shape design and augmentation require the ability to make specific edits to an object's semantic parameters (e. g., the pose of a person's arm or the length of an airplane's wing) while preserving as much existing details as possible.
A common dilemma in 3D object detection for autonomous driving is that high-quality, dense point clouds are only available during training, but not testing.
Features from multiple per view predictions are finally fused on 3D mesh vertices to predict mesh semantic segmentation labels.
Ranked #4 on Semantic Segmentation on ScanNet
We use a U-Net style 3D sparse convolution network to extract features for each frame's LiDAR point-cloud.
We present a simple and flexible object detection framework optimized for autonomous driving.
We study an unsupervised domain adaptation problem for the semantic labeling of 3D point clouds, with a particular focus on domain discrepancies induced by different LiDAR sensors.
Typical end-to-end formulations for learning robotic navigation involve predicting a small set of steering command actions (e. g., step forward, turn left, turn right, etc.)
In contrast, we propose a general-purpose method that works on both indoor and outdoor scenes.
Then, we use the decoder as a component in a shape optimization that solves for a set of latent codes on a regular grid of overlapping crops such that an interpolation of the decoded local shapes matches a partial or noisy observation.
In this work, we present a novel approach for color texture generation using a conditional adversarial loss obtained from weakly-supervised views.
The goal of this project is to learn a 3D shape representation that enables accurate surface reconstruction, compact storage, efficient computation, consistency for similar shapes, generalization across diverse shape categories, and inference from depth camera observations.
A key aspect of our grasping model is that it uses "action-view" based rendering to simulate future states with respect to different possible actions.
In depth-sensing applications ranging from home robotics to AR/VR, it will be common to acquire 3D scans of interior spaces repeatedly at sparse time intervals (e. g., as part of regular daily use).
To allow for widely varying geometry and topology, we choose an implicit surface representation based on composition of local shape elements.
In this work, we introduce the novel problem of identifying dense canonical 3D coordinate frames from a single RGB image.
In this work, we propose an end-to-end formulation that jointly learns to infer control parameters for grasping and throwing motion primitives from visual observations (images of arbitrary objects in a bin) through trial and error.
We introduce, TextureNet, a neural network architecture designed to extract features from high-resolution signals associated with 3D surface meshes (e. g., color texture maps).
Ranked #11 on Semantic Segmentation on ScanNet
We propose a new procedure to guide training of a data-driven shape generative model using a structure-aware loss function.
In this paper we present ActiveStereoNet, the first deep learning solution for active stereo systems.
We present Im2Pano3D, a convolutional neural network that generates a dense prediction of 3D structure and a probability distribution of semantic labels for a full 360 panoramic view of an indoor scene when given only a partial observation ( <=50%) in the form of an RGB-D image.
Skilled robotic manipulation benefits from complex synergies between non-prehensile (e. g. pushing) and prehensile (e. g. grasping) actions: pushing can help rearrange cluttered objects to make space for arms and fingers; likewise, grasping can help displace objects to make pushing movements more precise and collision-free.
To this end, we first learn joint embeddings of freeform text descriptions and colored 3D shapes.
We introduce a novel RGB-D patch descriptor designed for detecting coplanar surfaces in SLAM reconstruction.
We present Im2Pano3D, a convolutional neural network that generates a dense prediction of 3D structure and a probability distribution of semantic labels for a full 360 panoramic view of an indoor scene when given only a partial observation (<= 50%) in the form of an RGB-D image.
We present MINOS, a simulator designed to support the development of multisensory models for goal-directed navigation in complex indoor environments.
1 code implementation • 17 Oct 2017 • Li Yi, Lin Shao, Manolis Savva, Haibin Huang, Yang Zhou, Qirui Wang, Benjamin Graham, Martin Engelcke, Roman Klokov, Victor Lempitsky, Yuan Gan, Pengyu Wang, Kun Liu, Fenggen Yu, Panpan Shui, Bingyang Hu, Yan Zhang, Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Minki Jeong, Jaehoon Choi, Changick Kim, Angom Geetchandra, Narasimha Murthy, Bhargava Ramu, Bharadwaj Manda, M. Ramanathan, Gautam Kumar, P Preetham, Siddharth Srivastava, Swati Bhugra, Brejesh lall, Christian Haene, Shubham Tulsiani, Jitendra Malik, Jared Lafer, Ramsey Jones, Siyuan Li, Jie Lu, Shi Jin, Jingyi Yu, Qi-Xing Huang, Evangelos Kalogerakis, Silvio Savarese, Pat Hanrahan, Thomas Funkhouser, Hao Su, Leonidas Guibas
We introduce a large-scale 3D shape understanding benchmark using data and annotation from ShapeNet 3D object database.
2 code implementations • 3 Oct 2017 • Andy Zeng, Shuran Song, Kuan-Ting Yu, Elliott Donlon, Francois R. Hogan, Maria Bauza, Daolin Ma, Orion Taylor, Melody Liu, Eudald Romo, Nima Fazeli, Ferran Alet, Nikhil Chavan Dafle, Rachel Holladay, Isabella Morona, Prem Qu Nair, Druck Green, Ian Taylor, Weber Liu, Thomas Funkhouser, Alberto Rodriguez
Since product images are readily available for a wide range of objects (e. g., from the web), the system works out-of-the-box for novel objects without requiring any additional training data.
Access to large, diverse RGB-D datasets is critical for training RGB-D scene understanding algorithms.
This paper proposes the idea of using a generative adversarial network (GAN) to assist a novice user in designing real-world shapes with a simple interface.
Convolutional networks for image classification progressively reduce resolution until the image is represented by tiny feature maps in which the spatial structure of the scene is no longer discernible.
We provide a search algorithm that generates a sampling of likely candidate views according to the example distribution, and a set selection algorithm that chooses a subset of the candidates that jointly cover the example distribution.
A key requirement for leveraging supervised deep learning methods is the availability of large, labeled datasets.
Ranked #21 on Semantic Segmentation on ScanNet
One of the bottlenecks in training for better representations is the amount of available per-pixel ground truth data that is required for core scene understanding tasks such as semantic segmentation, normal prediction, and object edge detection.
This paper focuses on semantic scene completion, a task for producing a complete 3D voxel representation of volumetric occupancy and semantic labels for a scene from a single-view depth map observation.
To amass training data for our model, we propose a self-supervised feature learning method that leverages the millions of correspondence labels found in existing RGB-D reconstructions.
Ranked #2 on 3D Reconstruction on Scan2CAD
10 code implementations • 9 Dec 2015 • Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qi-Xing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, Fisher Yu
We present ShapeNet: a richly-annotated, large-scale repository of shapes represented by 3D CAD models of objects.
While there has been remarkable progress in the performance of visual recognition algorithms, the state-of-the-art models tend to be exceptionally data-hungry.
This paper describes an automatic algorithm for global alignment of LiDAR data collected with Google Street View cars in urban environments.