Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy web-crawled datasets.
We introduce new methods for 1) accelerating and 2) stabilizing training for large language-vision models.
Here, we aim to explain these differences by analyzing the impact of these objectives on the structure and transferability of the learned representations.
1 code implementation • 24 Apr 2023 • Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Tian, Avi Schwarzschild, Andrew Gordon Wilson, Jonas Geiping, Quentin Garrido, Pierre Fernandez, Amir Bar, Hamed Pirsiavash, Yann Lecun, Micah Goldblum
Self-supervised learning, dubbed the dark matter of intelligence, is a promising path to advance machine learning.
Our approach applies the formalism of Lie groups to capture continuous transformations to improve models' robustness to distributional shifts.
We study not only how robust recent state-of-the-art models are, but also the extent to which models can generalize variation in factors when they're present during training.
As datasets and models become increasingly large, distributed training has become a necessary component to allow deep neural networks to train in reasonable amounts of time.
Finally, we experiment initializing the T-CNN from a partially trained CNN, and find that it reaches better performance than the corresponding hybrid model trained from scratch, while reducing training time.
We initialise the GPSA layers to mimic the locality of convolutional layers, then give each attention head the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information.
Ranked #463 on Image Classification on ImageNet
Methods for understanding the decisions of and mechanisms underlying deep neural networks (DNNs) typically rely on building intuition by emphasizing sensory or semantic features of individual examples.
Furthermore, the input-unit gradient is more variable across samples and units in high-selectivity networks compared to low-selectivity networks.
Humans can learn and reason under substantial uncertainty in a space of infinitely many concepts, including structured relational concepts ("a scene with objects that have the same color") and ad-hoc categories defined through goals ("objects that could fall on one's head").
Recent advances in deep reinforcement learning require a large amount of training data and generally result in representations that are often over specialized to the target task.
For ResNet20 trained on CIFAR10 we could reduce class selectivity by a factor of 2. 5 with no impact on test accuracy, and reduce it nearly to zero with only a small ($\sim$2%) drop in test accuracy.
We seek to learn a representation on a large annotated data source that generalizes to a target domain using limited new supervision.
In this work, we investigate the use of standard pruning methods, developed primarily for supervised learning, for networks trained without labels (i. e. on self-supervised tasks).
Surprisingly, we find that slight differences in task have no measurable effect on the visual representation for both SqueezeNet and ResNet architectures.
We leverage this scaling to train an agent for 2. 5 Billion steps of experience (the equivalent of 80 years of human experience) -- over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs.
Ranked #1 on PointGoal Navigation on Gibson PointGoal Navigation
The lottery ticket hypothesis argues that neural networks contain sparse subnetworks, which, if appropriately initialized (the winning tickets), are capable of matching the accuracy of the full network when trained in isolation.