This assumption is mostly satisfied in datasets such as ImageNet where there is a large, centered object, which is highly likely to be present in random crops of the full image.
In this work, we present Pyramid Adversarial Training, a simple and effective technique to improve ViT's overall performance.
Ranked #3 on Domain Generalization on ImageNet-C (using extra training data)
Disentangled visual representations have largely been studied with generative models such as Variational AutoEncoders (VAEs).
We use our reconstruction model as a tool for exploring the nature of representations, including: the influence of model architecture and training objectives (specifically robust losses), the forms of invariance that networks achieve, representational differences between correctly and incorrectly classified images, and the effects of manipulating logits and images.
Contrastive learning between multiple views of the data has recently achieved state of the art performance in the field of self-supervised representation learning.
Ranked #50 on Self-Supervised Image Classification on ImageNet
Contrastive learning applied to self-supervised representation learning has seen a resurgence in recent years, leading to state of the art performance in the unsupervised training of deep image models.
Ranked #349 on Image Classification on ImageNet
The focus of recent meta-learning research has been on the development of learning algorithms that can quickly adapt to test time tasks with limited data and low computational cost.
We demonstrate that this objective ignores important structural knowledge of the teacher network.
Ranked #14 on Knowledge Distillation on ImageNet
Image extension models have broad applications in image editing, computational photography and computer graphics.
Ranked #2 on Uncropping on Places2 val
Using this regularizer, we exceed current state of the art and achieve 47% adversarial accuracy for ImageNet with l-infinity adversarial perturbations of radius 4/255 under an untargeted, strong, white-box attack.
We analyze key properties of the approach that make it work, finding that the contrastive loss outperforms a popular alternative based on cross-view prediction, and that the more views we learn from, the better the resulting representation captures underlying scene semantics.
Ranked #40 on Self-Supervised Action Recognition on UCF101
This operator can learn a strict super-set of what can be learned by average pooling or convolutions.
In this paper, we propose such a measure, and conduct extensive empirical studies on how well it can predict the generalization gap.
We study the problem of reconstructing an image from information stored at contour locations.
We present a method for synthesizing a frontal, neutral-expression image of a person's face given an input face photograph.
Collecting well-annotated image datasets to train modern machine learning algorithms is prohibitively expensive for many tasks.
However, by focusing only on creating a mapping or shared representation between the two domains, they ignore the individual characteristics of each domain.
Ranked #1 on Domain Adaptation on Synth Digits-to-SVHN
We demonstrate that this frame- work works well on two important mid-level vision tasks: intrinsic image decomposition and depth from an RGB im- age.
We propose a self-supervised framework that learns to group visual entities based on their rate of co-occurrence in space and time.