Specifically, we decompose the NVS task into two stages: (i) transforming observed regions to a novel view, and (ii) hallucinating unseen regions.
Unlike traditional multi-view stereo approaches, the neural implicit surface-based methods leverage neural networks to represent 3D scenes as signed distance functions (SDFs).
This paper proposes using an implicit feature representation of the scene elements to distinguish a physically plausible alignment of humans and objects from an implausible one.
Prior works have demonstrated that implicit representations trained only for reconstruction tasks typically generate encodings that are not useful for semantic tasks.
Our approach encodes 3D geometric features during training and renders dense novel view features at test time to refine estimated camera poses from arbitrary APR methods.
The distinct data modality offered by high-resolution RGB images and low-resolution depth maps captured on a mobile device, when combined with precise 3D geometry annotations, presents a unique opportunity for future research on high-fidelity 3D reconstruction.
Recent advances in this direction demonstrate the possibility of jointly optimising a NeRF and camera poses in forward-facing scenes.
This paper proposes a novel framework, ObjectSDF, to build an object-compositional neural implicit representation with high fidelity in 3D reconstruction and object representation.
Dense 3D reconstruction from a stream of depth images is the key to many mixed reality and robotic applications.
Localizing objects and estimating their extent in 3D is an important step towards high-level 3D scene understanding, which has many applications in Augmented Reality and Robotics.
Given a new RGB frame, MOLTR firstly applies a monocular 3D detector to localise objects of interest and extract their shape codes that represent the object shapes in a learned embedding space.
We introduce FroDO, a method for accurate 3D reconstruction of object instances from RGB video that infers object location, pose and shape in a coarse-to-fine manner.
While sparse point-based SLAM methods provide accurate camera localization, the generated maps lack semantic information.
Given a single image at an arbitrary viewpoint, a CNN predicts multiple surfaces, each in a canonical location relative to the object.
Despite learning based methods showing promising results in single view depth estimation and visual odometry, most existing approaches treat the tasks in a supervised manner.