Compositionality is a common property in many modalities including natural languages and images, but the compositional generalization of multi-modal models is not well-understood.
Bird's-Eye View (BEV) features are popular intermediate scene representations shared by the 3D backbone and the detector head in LiDAR-based object detectors.
The first task focuses on learning semantic information by sorting local groups of points in the scene into a globally consistent set of semantically meaningful clusters using contrastive learning.
Reconstructing 3D objects is an important computer vision task that has wide application in AR/VR.
In this paper, we present FvOR, a learning-based object reconstruction method that predicts accurate 3D models given a few images with noisy input poses.
This task is challenging because 3D scenes exhibit diverse patterns, ranging from continuous ones, such as object sizes and the relative poses between pairs of shapes, to discrete patterns, such as occurrence and co-occurrence of objects with symmetrical relationships.
Our approach builds on an approximation of the as-rigid-as possible (or ARAP) deformation energy.
Pretraining on large labeled datasets is a prerequisite to achieve good performance in many computer vision tasks like 2D object recognition, video classification etc.
We show how to convert the predicted geometric primitives into object proposals by defining a distance function between an object and the geometric primitives.
Ranked #14 on 3D Object Detection on ScanNetV2
We formulate this problem as joint learning of multiple copies of the same network architecture and enforce the network weights to be shared across these networks.
Optimizing a network of maps among a collection of objects/domains (or map synchronization) is a central problem across computer vision and many other relevant fields.
We show a principled way to train this model by combining discriminator losses for both a 3D object arrangement representation and a 2D image-based representation.