The visual world can be parsimoniously characterized in terms of distinct entities with sparse interactions.
A compositional understanding of the world in terms of objects and their geometry in 3D space is considered a cornerstone of human cognition.
2 code implementations • 12 May 2022 • Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, Neil Houlsby
Combining simple architectures with large-scale pre-training has led to massive improvements in image classification.
Ranked #1 on One-Shot Object Detection on COCO
Object-centric representations are a promising path toward more systematic generalization by providing flexible abstractions upon which compositional world models can be built.
Neural Networks require large amounts of memory and compute to process high resolution images, even when only a small part of the image is actually informative for the task at hand.
We propose a method to learn image representations from uncurated videos.
Learning object-centric representations of complex scenes is a promising step towards enabling efficient abstract reasoning from low-level perceptual features.
We propose a general framework for self-supervised learning of transferable visual representations based on Video-Induced Visual Invariances (VIVI).
Ranked #14 on Image Classification on VTAB-1k (using extra training data)
We propose a novel method for learning convolutional neural image representations without manual supervision.
Image representations, from SIFT and bag of visual words to Convolutional Neural Networks (CNNs) are a crucial component of almost all computer vision systems.