In this paper, we leverage recent progress in diffusion models to equip 3D scene representation learning models with the ability to render high-fidelity novel views, while retaining benefits such as object-level scene editing to a large degree.
We show how to leverage recurrence by conditioning the latent tokens at each forward pass of the reverse diffusion process with those from prior computation, i. e. latent self-conditioning.
Ranked #1 on Image Generation on ImageNet 64x64
This paper proposes a self-supervised objective for learning representations that localize objects under occlusion - a property known as object permanence.
Our experiments demonstrate that, despite only capturing a small subset of the objects that move, this signal is enough to generalize to segment both moving and static instances of dynamic objects.
A range of video modeling tasks, from optical flow to multiple object tracking, share the same fundamental challenge: establishing space-time correspondence.
Learning robotic manipulation tasks using reinforcement learning with sparse rewards is currently impractical due to the outrageous data requirements.
In experiments on vision-based navigation and manipulation domains, we show that the algorithm allows for unsupervised meta-learning that transfers to downstream tasks specified by hand-crafted reward functions and serves as pre-training for more efficient supervised meta-learning of test task distributions.
We introduce a self-supervised method for learning visual correspondence from unlabeled video.
A key challenge in complex visuomotor control is learning abstract representations that are effective for specifying goals, planning, and generalization.
We find that the representations learned are not only effective for goal-directed visual imitation via gradient-based trajectory optimization, but can also provide a metric for specifying goals using images.
We introduce a variety of models, trained on a supervised image captioning corpus to predict the image features for a given caption, to perform sentence representation grounding.
With machine learning successfully applied to new daunting problems almost every day, general AI starts looking like an attainable goal.
Real-world image recognition systems need to recognize tens of thousands of classes that constitute a plethora of visual concepts.
Ranked #2 on Zero-Shot Transfer Image Classification on SUN
Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding.
We train convolutional networks on a dataset of 100 million Flickr photos and captions, and show that these networks produce features that perform well in a range of vision problems.