Recent advancements in open-world 3D object generation have been remarkable, with image-to-3D methods offering superior fine-grained control over their text-to-3D counterparts.
no code implementations • 3 Nov 2023 • Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, Priya Sundaresan, Peng Xu, Hao Su, Karol Hausman, Chelsea Finn, Quan Vuong, Ted Xiao
Generalization remains one of the most important desiderata for robust robot learning systems.
Large Language Models (LLMs) have achieved tremendous progress, yet they still often struggle with challenging reasoning problems.
TD-MPC is a model-based reinforcement learning (RL) algorithm that performs local trajectory optimization in the latent space of a learned implicit (decoder-free) world model.
We report Zero123++, an image-conditioned diffusion model for generating 3D-consistent multi-view images from a single input view.
Existing works rely on the independence assumption of points in the radiance field or the pixels in input views to obtain tractable forms of the probability density function.
In contrast to TensoRF which uses a global tensor and focuses on their vector-matrix decomposition, we propose to utilize a cloud of local tensors and apply the classic CANDECOMP/PARAFAC (CP) decomposition to factorize each tensor into triple vectors that express local feature distributions along spatial axes and compactly encode a local neural field.
We then present a practical model-based RL method, called Reparameterized Policy Gradient (RPG), which leverages the multimodal policy parameterization and learned world model to achieve strong exploration capabilities and high data efficiency.
Unlike these studies, our 3Deformer is a non-training and common framework, which only requires supervision of readily-available semantic images, and is compatible with editing various objects unlimited by datasets.
For real-world experiments, AnyTeleop can outperform a previous system that was designed for a specific robot hardware with a higher success rate, using the same robot.
Model distillation, the process of creating smaller, faster models that maintain the performance of larger models, is a promising direction towards the solution.
Single image 3D reconstruction is an important but challenging task that requires extensive knowledge of our natural world.
Recent studies on visual reinforcement learning (visual RL) have explored the use of 3D visual representations.
In light of this, we propose to decompose a reasoning verification process into a series of step-by-step subprocesses, each only receiving their necessary context and premises.
Image ad understanding is a crucial task with wide real-world applications.
We present a method for generating high-quality watertight manifold meshes from multi-view input images.
Hand-eye calibration is a critical task in robotics, as it directly affects the efficacy of critical operations such as manipulation and grasping.
We propose TensoIR, a novel inverse rendering approach based on tensor factorization and neural fields.
3D-aware image synthesis encompasses a variety of tasks, such as scene generation and novel view synthesis from images.
We study generalizable policy learning from demonstrations for complex low-level control tasks (e. g., contact-rich object manipulations).
Reinforcement learning approaches for dexterous rigid object manipulation would struggle in this setting due to the complexity of physics interaction with deformable objects.
Although reinforcement learning has seen tremendous success recently, this kind of trial-and-error learning can be impractical or inefficient in complex environments.
Under the Lagrangian view, we parameterize the scene motion by tracking the trajectory of particles on objects.
We address efficient and structure-aware 3D scene representation from images.
1 code implementation • 9 Feb 2023 • Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, Xiaodi Yuan, Pengwei Xie, Zhiao Huang, Rui Chen, Hao Su
Generalizable manipulation skills, which can be composed to tackle long-horizon and complex daily chores, are one of the cornerstones of Embodied AI.
Our experiments show that DiF leads to improvements in approximation quality, compactness, and training time when compared to previous fast reconstruction methods.
We identify key ingredients for leveraging demonstrations in model learning -- policy pretraining, targeted exploration, and oversampling of demonstration data -- which forms the three phases of our model-based RL framework.
In this paper, we examine the effectiveness of pre-training for visuo-motor control tasks.
Generalizable 3D part segmentation is important but challenging in vision and robotics.
We empirically evaluate our method using an Allegro Hand to grasp novel objects in both simulation and real world.
Reinforcement Learning (RL) algorithms can solve challenging control problems directly from image observations, but they often require millions of environment interactions to do so.
Our method co-designs an efficient labeling process with semi/weakly supervised learning and is applicable to nearly any 3D semantic segmentation backbones.
In the abstract environment, complex dynamics such as physical manipulation are removed, making abstract trajectories easier to generate.
We study how choices of input point cloud coordinate frames impact learning of manipulation skills from 3D point clouds.
To tackle the entire task, prior work chains multiple stationary manipulation skills with a point-goal navigation skill, which are learned individually on subtasks.
Generalization in deep reinforcement learning over unseen environment variations usually requires policy learning over a large set of diverse training variations.
Approximate convex decomposition aims to decompose a 3D shape into a set of almost convex components, whose convex hulls can then be used to represent the input shape.
Extensive experimental results suggest that: 1) on multi-stage tasks that are infeasible for the vanilla differentiable physics solver, our approach discovers contact points that efficiently guide the solver to completion; 2) on tasks where the vanilla solver performs sub-optimally or near-optimally, our contact point discovery method performs better than or on par with the manipulation performance obtained with handcrafted contact points.
We propose to perform imitation learning for dexterous manipulation with multi-finger robot hand from human demonstrations, and transfer the policy to the real robot hand.
We demonstrate that NeRFusion achieves state-of-the-art quality on both large-scale indoor and small-scale object scenes, with substantially faster reconstruction than NeRF and other recent methods.
We demonstrate that applying traditional CP decomposition -- that factorizes tensors into rank-one components with compact vectors -- in our framework leads to improvements over vanilla NeRF.
Ranked #3 on Novel View Synthesis on X3D
Data-driven model predictive control has two key advantages over model-free methods: a potential for improved sample efficiency through model learning, and better performance as computational budget for planning increases.
First, we demonstrate the transferability of our method to out-of-distribution real data by using a mixed domain learning strategy.
We approach text-to-image generation by combining the power of the retrained CLIP representation with an off-the-shelf image generator (GANs), optimizing in the latent space of GAN to find images that achieve maximum CLIP score with the given input text.
Ranked #47 on Text-to-Image Generation on COCO
Manga is a fashionable Japanese-style comic form that is composed of black-and-white strokes and is generally displayed as raster images on digital devices.
Here we propose SAPIEN Manipulation Skill Benchmark (ManiSkill) to benchmark manipulation skills over diverse objects in a full-physics simulator.
Our method greatly improves stability and sample efficiency of ConvNets under augmentation, and achieves generalization results competitive with state-of-the-art methods for image-based RL in environments with unseen visuals.
Contrary to the vast literature in modeling, perceiving, and understanding agent-object (e. g., human-object, hand-object, robot-object) interaction in computer vision and robotics, very few past works have studied the task of object-object interaction, which also plays an important role in robotic manipulation and planning tasks.
We propose a teleoperation system that uses a single RGB-D camera as the human motion capture device.
We propose JetNet as a novel point-cloud-style dataset for the ML community to experiment with, and set MPGAN as a benchmark to improve upon for future generative models.
For the first time, we propose a unified framework that can handle 9DoF pose tracking for novel rigid object instances as well as per-part pose tracking for articulated objects from known categories.
Experimental results suggest that 1) RL-based approaches struggle to solve most of the tasks efficiently; 2) gradient-based approaches, by optimizing open-loop control sequences with the built-in differentiable physics engine, can rapidly find a solution within tens of iterations, but still fall short on multi-stage tasks that require long-term planning.
We present MVSNeRF, a novel neural rendering approach that can efficiently reconstruct neural radiance fields for view synthesis.
We introduce GNeRF, a framework to marry Generative Adversarial Networks (GAN) with Neural Radiance Field (NeRF) reconstruction for the complex scenarios with unknown and even randomly initialized camera poses.
We achieve this by introducing a 3D-to-2D texture mapping (or surface parameterization) network into volumetric representations.
Given a collection of 3D meshes of a category and their deformation handles (control points), our method learns a set of meta-handles for each shape, which are represented as combinations of the given handles.
Medical robots can play an important role in mitigating the spread of infectious diseases and delivering quality care to patients during the COVID-19 pandemic.
Many applications of unpaired image-to-image translation require the input contents to be preserved semantically during translations.
However, there remains a much more difficult and under-explored issue on how to generalize the learned skills over unseen object categories that have very different shape geometry distributions.
Current graph neural networks (GNNs) lack generalizability with respect to scales (graph sizes, graph diameters, edge weights, etc..) when solving many graph analysis problems.
Quick Response (QR) code is one of the most worldwide used two-dimensional codes.~Traditional QR codes appear as random collections of black-and-white modules that lack visual semantics and aesthetic elements, which inspires the recent works to beautify the appearances of QR codes.
no code implementations • 3 Nov 2020 • Dhruv Batra, Angel X. Chang, Sonia Chernova, Andrew J. Davison, Jia Deng, Vladlen Koltun, Sergey Levine, Jitendra Malik, Igor Mordatch, Roozbeh Mottaghi, Manolis Savva, Hao Su
In the rearrangement task, the goal is to bring a given physical environment into a specified state.
Current graph neural networks (GNNs) lack generalizability with respect to scales (graph sizes, graph diameters, edge weights, etc..) when solving many graph analysis problems.
To alleviate the resource constraint for real-time point cloud applications that run on edge devices, in this paper we present BiPointNet, the first model binarization approach for efficient deep learning on point clouds.
To fully make use of our deep neural network, we partition the scene space into an adaptive hierarchical grid, in which we apply our network to reconstruct high-quality sampling distributions for any local region in the scene.
3D shape completion for real data is important but challenging, since partial point clouds acquired by real-world sensors are usually sparse, noisy and unaligned.
Estimating relative camera poses from consecutive frames is a fundamental problem in visual odometry (VO) and simultaneous localization and mapping (SLAM), where classic methods consisting of hand-crafted features and sampling-based outlier rejection have been a dominant choice for over a decade.
Huge imbalance of different scenes' sample numbers seriously reduces Synthetic Aperture Radar (SAR) ship detection accuracy.
To address this issue, we propose a SofGAN image generator to decouple the latent space of portraits into two subspaces: a geometry space and a texture space.
We further study how different evaluation metrics weigh the sampling pattern against the geometry and propose several perceptual metrics forming a sampling spectrum of metrics.
This network is easy to incorporate in many previous photon mapping methods (by simply swapping the kernel density estimator) and can produce high-quality reconstructions of complex global illumination effects like caustics with an order of magnitude fewer photons compared to previous photon mapping methods.
Manga is a world popular comic form originated in Japan, which typically employs black-and-white stroke lines and geometric exaggeration to describe humans' appearances, poses, and actions.
To achieve this task, a simulated environment with physically realistic simulation, sufficient articulated objects, and transferability to the real robot is indispensable.
We address the problem of discovering 3D parts for objects in unseen categories.
In contrast, we propose adaptive thin volumes (ATVs); in an ATV, the depth hypothesis of each plane is spatially varying, which adapts to the uncertainties of previous per-pixel depth predictions.
Ranked #9 on 3D Reconstruction on DTU
Learning to encode differences in the geometry and (topological) structure of the shapes of ordinary objects is key to generating semantically plausible variations of a given shape, transferring edits from one shape to another, and many other applications in 3D content creation.
Fusion of 2D images and 3D point clouds is important because information from dense images can enhance sparse point clouds.
To perform well, the policy must infer the task identity from collected transitions by modelling its dependency on states, actions and rewards.
Based on the claim, we propose to learn the transition model by matching the distributions of multi-step rollouts sampled from the transition model and the real ones via WGAN.
Combining ideas from Batch RL and Meta RL, we propose tiMe, which learns distillation of multiple value functions and MDP embeddings from only existing data.
The importance of training robust neural network grows as 3D data is increasingly utilized in deep learning for vision tasks in robotics, drone control, and autonomous driving.
An agent that has well understood the environment should be able to apply its skills for any given goals, leading to the fundamental problem of learning the Universal Value Function Approximator (UVFA).
We introduce StructureNet, a hierarchical graph network which (i) can directly encode shapes represented as such n-ary graphs; (ii) can be robustly trained on large and complex shape families; and (iii) can be used to generate a great diversity of realistic structured shape geometries.
A human-specified design choice in domain randomization is the form and parameters of the distribution of simulated environments.
Reconstruction of geometry based on different input modes, such as images or point clouds, has been instrumental in the development of computer aided design and computer graphics.
We present a preliminary evaluation of adversarial attacks on deep 3D point cloud classifiers, namely PointNet and PointNet++, by evaluating both white-box and black-box adversarial attacks that were proposed for 2D images and extending those attacks to reduce the perceptibility of the perturbations in 3D space.
To reduce memory footprint and run-time latency, techniques such as neural network pruning and binarization have been explored separately.
We present PartNet: a consistent, large-scale dataset of 3D objects annotated with fine-grained, instance-level, and hierarchical 3D part information.
Ranked #3 on 3D Instance Segmentation on PartNet
We propose an adversarial defense method that achieves state-of-the-art performance among attack-agnostic adversarial defense methods while also maintaining robustness to input resolution, scale of adversarial perturbation, and scale of dataset size.
Specifically, we first segment each car with a pre-trained Mask R-CNN, and then regress towards its 3D pose and shape based on a deformable 3D car model with or without using semantic keypoints.
In this paper, we investigate the problem of transfer learning across environments with different dynamics while accomplishing the same task in the continuous control domain.
In this paper, we explore how the observation of different articulation states provides evidence for part structure and motion of 3D objects.
Binary neural networks (BNN) have been studied extensively since they run dramatically faster at lower memory and power consumption than floating-point networks, thanks to the efficiency of bit operations.
In addition, we also find that a progressive training strategy can foster a better neural network for the video recognition task than blindly pooling the distinct sources of geometry cues together.
Even though our shapes have independent discretizations and no functional correspondences are provided, the network is able to generate latent bases, in a consistent order, that reflect the shared semantic structure among the shapes.
In this work, we study 3D object detection from RGB-D data in both indoor and outdoor scenes.
Ranked #1 on Object Localization on KITTI Cars Moderate
1 code implementation • 17 Oct 2017 • Li Yi, Lin Shao, Manolis Savva, Haibin Huang, Yang Zhou, Qirui Wang, Benjamin Graham, Martin Engelcke, Roman Klokov, Victor Lempitsky, Yuan Gan, Pengyu Wang, Kun Liu, Fenggen Yu, Panpan Shui, Bingyang Hu, Yan Zhang, Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Minki Jeong, Jaehoon Choi, Changick Kim, Angom Geetchandra, Narasimha Murthy, Bhargava Ramu, Bharadwaj Manda, M. Ramanathan, Gautam Kumar, P Preetham, Siddharth Srivastava, Swati Bhugra, Brejesh lall, Christian Haene, Shubham Tulsiani, Jitendra Malik, Jared Lafer, Ramsey Jones, Siyuan Li, Jie Lu, Shi Jin, Jingyi Yu, Qi-Xing Huang, Evangelos Kalogerakis, Silvio Savarese, Pat Hanrahan, Thomas Funkhouser, Hao Su, Leonidas Guibas
We introduce a large-scale 3D shape understanding benchmark using data and annotation from ShapeNet 3D object database.
The combinatorial nature of part arrangements poses another challenge, since the retrieval network is not a function: several complements can be appropriate for the same input.
By exploiting metric space distances, our network is able to learn local features with increasing contextual scales.
Ranked #2 on Semantic Segmentation on Toronto-3D L002
Rendered with realistic environment maps, millions of synthetic images of objects and their corresponding albedo, shading, and specular ground-truth images are used to train an encoder-decoder CNN.
Important high-level vision tasks such as human-object interaction, image captioning and robotic manipulation require rich semantic descriptions of objects at part level.
Our final solution is a conditional shape sampler, capable of predicting multiple plausible 3D point clouds from an input image.
Ranked #2 on 3D Reconstruction on Data3D−R2N2 (using extra training data)
To enable the prediction of vertex functions on them by convolutional neural networks, we resort to spectral CNN method that enables weight sharing by parameterizing kernels in the spectral domain spanned by graph laplacian eigenbases.
Ranked #54 on 3D Part Segmentation on ShapeNet-Part
Point cloud is an important type of geometric data structure.
Ranked #1 on Semantic Segmentation on S3DIS Area5 (Number of params metric)
We present a learning framework for abstracting complex shapes by learning to assemble objects using 3D volumetric primitives.
Despite its successful progress in classic point-to-point search, there are few studies regarding point-to-hyperplane search, which has strong practical capabilities of scaling up in many applications like active learning with SVMs.
Each field probing filter is a set of probing points --- sensors that perceive the space.
Ranked #5 on 3D Object Recognition on ModelNet40
Empirical results from these two types of CNNs exhibit a large gap, indicating that existing volumetric CNN architectures and approaches are unable to fully exploit the power of 3D representations.
Ranked #3 on 3D Object Recognition on ModelNet40
Human 3D pose estimation from a single image is a challenging task with numerous applications.
14 code implementations • 9 Dec 2015 • Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qi-Xing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, Fisher Yu
We present ShapeNet: a richly-annotated, large-scale repository of shapes represented by 3D CAD models of objects.
Comparing two images from different views has been a long-standing challenging problem in computer vision, as visual features are not stable under large view point changes.
Given i. i. d samples from some unknown continuous density on hyper-rectangle $[0, 1]^d$, we attempt to learn a piecewise constant function that approximates this underlying density non-parametrically.
In this paper, we proposed a pose estimation system based on rendered image training set, which predicts the pose of objects in real image, with knowledge of object category and tight bounding box.
Object viewpoint estimation from 2D images is an essential task in computer vision.
In this paper, given a single input image of an object, we synthesize new features for other views of the same object.
12 code implementations • 1 Sep 2014 • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, Li Fei-Fei
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images.
Robust low-level image features have been proven to be effective representations for a variety of visual recognition tasks such as object recognition and scene classification; but pixels, or even local image patches, carry little semantic meanings.