A compositional understanding of the world in terms of objects and their geometry in 3D space is considered a cornerstone of human cognition.
Specifically, FixNet consists of a perception module to extract the structured representation from the 3D point cloud, a physical dynamics prediction module to simulate the results of interactions on 3D objects, and a functionality prediction module to evaluate the functionality and choose the correct fix.
Our network achieves the top performance in human motion prediction on the proposed dataset, thanks to the intent information from the gaze and the denoised gaze feature modulated by the motion.
Manipulating volumetric deformable objects in the real world, like plush toys and pizza dough, bring substantial challenges due to infinite shape variations, non-rigid motions, and partial observability.
ConDor is a self-supervised method that learns to Canonicalize the 3D orientation and position for full and partial 3D point clouds.
We describe a method to deal with performance drop in semantic segmentation caused by viewpoint changes within multi-camera systems, where temporally paired images are readily available, but the annotations may only be abundant for a few typical views.
Scenario generation is formulated as an optimization in the latent space of this traffic model, perturbing an initial real-world scene to produce trajectories that collide with a given planner.
We present SyNoRiM, a novel way to jointly register multiple non-rigid shapes by synchronizing the maps relating learned functions defined on the point clouds.
We study the problem of inferring an object-centric scene representation from a single image, aiming to derive a representation that explains the image formation process, captures the scene's 3D nature, and is learned without supervision.
A fundamental problem in equivariant deep learning is to design activation functions which are both informative and preserve equivariance.
For the first time, we propose a unified framework that can handle 9DoF pose tracking for novel rigid object instances as well as per-part pose tracking for articulated objects from known categories.
We propose a data-driven scene flow estimation algorithm exploiting the observation that many 3D scenes can be explained by a collection of agents moving as rigid bodies.
On KITTI, we are the first to demonstrate semi-supervised 3D object detection and our method surpasses a fully supervised baseline from 1. 8% to 7. 6% under different label ratios and categories.
Such a space naturally allows the disentanglement of geometric style (coming from the source) and structural pose (conforming to the target).
Grid vertices in a computational fluid dynamics (CFD) domain are viewed as point clouds and used as inputs to a neural network based on the PointNet architecture, which learns an end-to-end mapping between spatial positions and CFD quantities.
The former aims to recover the surface of point cloud through implicit function, while the latter encourages evenly-distributed points.
We investigate the problem of learning to generate 3D parametric surface representations for novel object instances, as seen from one or more views.
While significant progress has been made, especially with recent deep generative models, it remains a challenge to synthesize high-quality shapes with rich geometric details and complex structure, in a controllable manner.
We propose CaSPR, a method to learn object-centric Canonical Spatiotemporal Point Cloud Representations of dynamically moving or evolving objects.
Existing deep models predict 2D and 3D kinematic poses from video that are approximately accurate, but contain visible errors that violate physical constraints, such as feet penetrating the ground and bodies leaning at extreme angles.
To this end, we select a suite of diverse datasets and tasks to measure the effect of unsupervised pre-training on a large source set of 3D scenes.
We present an approach for aggregating a sparse set of views of an object in order to compute a semi-implicit 3D representation in the form of a volumetric feature grid.
We further study how different evaluation metrics weigh the sampling pattern against the geometry and propose several perceptual metrics forming a sampling spectrum of metrics.
To achieve this task, a simulated environment with physically realistic simulation, sufficient articulated objects, and transferability to the real robot is indispensable.
3D generative shape modeling is a fundamental research area in computer vision and interactive computer graphics, with many real-world applications.
When learning to sketch, beginners start with simple and flexible shapes, and then gradually strive for more complex and accurate ones in the subsequent training sessions.
Compared to prior work on multi-modal detection, we explicitly extract both geometric and semantic features from the 2D images.
Ranked #1 on 3D Object Detection on SUN-RGBD (using extra training data)
A recent trend in optimizing maps such as dense correspondences between objects or neural networks between pairs of domains is to optimize them jointly.
Learning to encode differences in the geometry and (topological) structure of the shapes of ordinary objects is key to generating semantically plausible variations of a given shape, transferring edits from one shape to another, and many other applications in 3D content creation.
A complex visual navigation task puts an agent in different situations which call for a diverse range of visual perception abilities.
We introduce StructureNet, a hierarchical graph network which (i) can directly encode shapes represented as such n-ary graphs; (ii) can be robustly trained on large and complex shape families; and (iii) can be used to generate a great diversity of realistic structured shape geometries.
We investigate the problem of learning category-specific 3D shape reconstruction from a variable number of RGB views of previously unobserved object instances.
Topology applied to real world data using persistent homology has started to find applications within machine learning, including deep learning.
We also find that these models are amenable to zero-shot transfer learning to novel object classes (e. g. transfer from training on chairs to testing on lamps), as well as to real-world images drawn from furniture catalogs.
An important question in task transfer learning is to determine task transferability, i. e. given a common input domain, estimating to what extent representations learned from a source task can help in learning a target task.
Current 3D object detection methods are heavily influenced by 2D detectors.
Ranked #9 on 3D Object Detection on SUN-RGBD val
Furthermore, these locations are continuous in space and can be learned by the network.
Ranked #1 on 3D Semantic Segmentation on STPLS3D
Reconstruction of geometry based on different input modes, such as images or point clouds, has been instrumental in the development of computer aided design and computer graphics.
The goal of this paper is to estimate the 6D pose and dimensions of unseen object instances in an RGB-D image.
Ranked #2 on 6D Pose Estimation using RGBD on CAMERA25
In this work, we focus on predicting the dynamics of 3D rigid objects, in particular an object's final resting position and total rotation when subjected to an impulsive force.
We present PartNet: a consistent, large-scale dataset of 3D objects annotated with fine-grained, instance-level, and hierarchical 3D part information.
Ranked #5 on 3D Semantic Segmentation on PartNet
We present an approach to inform the reconstruction of a surface from a point scan through topological priors.
Computational Geometry Graphics
Real-life man-made objects often exhibit strong and easily-identifiable structure, as a direct result of their design or their intended functionality.
To achieve this, we use intermediate nonlinear embedding spaces, computed individually on every shape; the embedding functions use ideas from diffusion geometry and capture how different descriptors on the same shape inter‐relate.
In addition, we also find that a progressive training strategy can foster a better neural network for the video recognition task than blindly pooling the distinct sources of geometry cues together.
In this work, we study 3D object detection from RGB-D data in both indoor and outdoor scenes.
Ranked #1 on Object Localization on KITTI Cyclists Moderate
By exploiting metric space distances, our network is able to learn local features with increasing contextual scales.
Ranked #3 on 3D Semantic Segmentation on KITTI-360
Point cloud is an important type of geometric data structure.
Ranked #2 on Scene Segmentation on ScanNet
We present a learning framework for abstracting complex shapes by learning to assemble objects using 3D volumetric primitives.
We introduce a new general representation for proximal interactions among physical objects that is agnostic to the type of objects or interaction involved.
Each field probing filter is a set of probing points --- sensors that perceive the space.
Ranked #3 on 3D Object Recognition on ModelNet40
Empirical results from these two types of CNNs exhibit a large gap, indicating that existing volumetric CNN architectures and approaches are unable to fully exploit the power of 3D representations.
Ranked #1 on 3D Object Recognition on ModelNet40
Comparing two images from different views has been a long-standing challenging problem in computer vision, as visual features are not stable under large view point changes.
Joint segmentation of image sets is a challenging problem, especially when there are multiple objects with variable appearance shared among the images in the collection and the set of objects present in each particular image is itself varying and unknown.
Joint matching over a collection of objects aims at aggregating information from a large collection of similar instances (e. g. images, graphs, shapes) to improve maps between pairs of them.