Self-supervised methods for learning object-centric representations have recently been applied successfully to various datasets.
We discuss the results and limitations of our approach in detail, and further outline potential ways to overcome the limitations and directions for future work.
2 code implementations • 6 Mar 2023 • Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, Pete Florence
Large language models excel at a wide range of complex tasks.
Ranked #1 on Visual Question Answering (VQA) on OK-VQA
Our main insight is that one can train a Pose Encoder that peeks at the target image and learns a latent pose embedding which is used by the decoder for view synthesis.
While recent object-centric models can successfully decompose a scene into objects, modeling their dynamics effectively still remains a challenge.
The visual world can be parsimoniously characterized in terms of distinct entities with sparse interactions.
A compositional understanding of the world in terms of objects and their geometry in 3D space is considered a cornerstone of human cognition.
1 code implementation • • Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J. Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, Thomas Kipf, Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-Ti, Liu, Henning Meyer, Yishu Miao, Derek Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Radwan, Daniel Rebain, Sara Sabour, Mehdi S. M. Sajjadi, Matan Sela, Vincent Sitzmann, Austin Stone, Deqing Sun, Suhani Vora, Ziyu Wang, Tianhao Wu, Kwang Moo Yi, Fangcheng Zhong, Andrea Tagliasacchi
Data is the driving force of machine learning, with the amount and quality of training data often being more important for the performance of a system than architecture and training details.
1 code implementation • • Mehdi S. M. Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan, Suhani Vora, Mario Lucic, Daniel Duckworth, Alexey Dosovitskiy, Jakob Uszkoreit, Thomas Funkhouser, Andrea Tagliasacchi
In this work, we propose the Scene Representation Transformer (SRT), a method which processes posed or unposed RGB images of a new area, infers a "set-latent scene representation", and synthesises novel views, all in a single feed-forward pass.
We present NeSF, a method for producing 3D semantic fields from posed RGB images alone.
Object-centric representations are a promising path toward more systematic generalization by providing flexible abstractions upon which compositional world models can be built.
We address this problem by introducing a global, set-based contrastive loss: instead of contrasting individual slot representations against one another, we aggregate the representations and contrast the joined sets against one another.
In order to meet the diverse challenges in solving many real-world problems, an intelligent agent has to be able to dynamically construct a model of its environment.
Human perception is structured around objects which form the basis for our higher-level cognition and impressive systematic generalization abilities.
Common-sense physical reasoning is an essential ingredient for any intelligent agent operating in the real-world.
We present a framework for efficient perceptual inference that explicitly reasons about the segmentation of its inputs and features.
Hyperparameter selection generally relies on running multiple full training trials, with selection based on validation set performance.
Disentangled distributed representations of data are desirable for machine learning, since they are more expressive and can generalize from fewer examples.
Theoretical and empirical evidence indicates that the depth of neural networks is crucial for their success.
Ranked #34 on Image Classification on MNIST
Several variants of the Long Short-Term Memory (LSTM) architecture for recurrent neural networks have been proposed since its inception in 1995.
Sequence prediction and classification are ubiquitous and challenging problems in machine learning that can require identifying complex dependencies between temporally distant inputs.