Our generative approach to classification, which we call Diffusion Classifier, attains strong results on a variety of benchmarks and outperforms alternative methods of extracting knowledge from diffusion models.
Ranked #1 on Image Classification on ObjectNet (ImageNet classes)
In our work, we find evidence that these losses are insufficient for the task of scene decomposition, without also considering architectural inductive biases.
This paper explores self-supervised learning of amodal 3D feature representations from RGB and RGB-D posed images and videos, agnostic to object and scene semantic content, and evaluates the resulting scene representations in the downstream tasks of visual correspondence, object tracking, and object detection.
Object motion predictions are computed by a graph neural network that operates over the object features extracted from the 3D neural scene representation.
We present neural architectures that disentangle RGB-D images into objects' shapes and styles and a map of the background scene, and explore their applications for few-shot 3D object detection and few-shot concept classification.
We can compare the 3D feature maps of two objects by searching alignment across scales and 3D rotations, and, as a result of the operation, we can estimate pose and scale changes without the need for 3D pose annotations.
We propose associating language utterances to 3D visual abstractions of the scene they describe.