Our findings also suggest that scaling synthetic data can be particularly effective in scenarios such as: (1) when there is a limited supply of real images for a supervised problem (e. g., fewer than 0. 5 million images in ImageNet), (2) when the evaluation dataset diverges significantly from the training data, indicating the out-of-distribution scenario, or (3) when synthetic data is used in conjunction with real images, as demonstrated in the training of CLIP models.
On K-Nearest-Neighbor image retrieval evaluation with ImageNet-1k, the same model outperforms the baseline by 1. 32%.
We demonstrate image generation and captioning performance on par with state-of-the-art text-to-image and image-to-text models with orders of magnitude fewer (only 3M) paired image-text data.
We find that image-text models (CLIP and ALIGN) are better at recognizing new examples of style transfer than masking-based models (CAN and MAE).
2 code implementations • 1 Jun 2023 • Kihyuk Sohn, Nataniel Ruiz, Kimin Lee, Daniel Castro Chin, Irina Blok, Huiwen Chang, Jarred Barber, Lu Jiang, Glenn Entis, Yuanzhen Li, Yuan Hao, Irfan Essa, Michael Rubinstein, Dilip Krishnan
Pre-trained large text-to-image models synthesize impressive images with an appropriate use of text prompts.
We investigate the potential of learning visual representations using synthetic images generated by text-to-image models.
In this paper, we propose a method of learning representations that are instead equivariant to data augmentations.
3 code implementations • 2 Jan 2023 • Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T. Freeman, Michael Rubinstein, Yuanzhen Li, Dilip Krishnan
Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding.
Ranked #20 on Text-to-Image Generation on COCO
In this work, we propose MAsked Generative Encoder (MAGE), the first framework to unify SOTA image generation and self-supervised representation learning.
Ranked #2 on Unconditional Image Generation on ImageNet 256x256
Our framework is a minimal and conceptually clean synthesis of (C) contrastive learning, (A) masked autoencoders, and (N) the noise prediction approach used in diffusion models.
1 code implementation • • Andrew B. Sellergren, Christina Chen, Zaid Nabulsi, Yuanzhen Li, Aaron Maschinot, Aaron Sarna, Jenny Huang, Charles Lau, Sreenivasa Raju Kalidindi, Mozziyar Etemadi, Florencia Garcia-Vicente, David Melnick, Yun Liu, Krish Eswaran, Daniel Tse, Neeral Beladia, Dilip Krishnan, Shravya Shetty
Supervised contrastive learning enabled performance comparable to state-of-the-art deep learning models in multiple clinical tasks by using as few as 45 images and is a promising method for predictive modeling with use of small data sets and for predicting outcomes in shifting patient populations.
This assumption is mostly satisfied in datasets such as ImageNet where there is a large, centered object, which is highly likely to be present in random crops of the full image.
In this work, we present pyramid adversarial training (PyramidAT), a simple and effective technique to improve ViT's overall performance.
Ranked #9 on Domain Generalization on ImageNet-C (using extra training data)
Accurately predicting the likelihood of interaction between two objects (compound-protein sequence, user-item, author-paper, etc.)
Disentangled visual representations have largely been studied with generative models such as Variational AutoEncoders (VAEs).
We use our reconstruction model as a tool for exploring the nature of representations, including: the influence of model architecture and training objectives (specifically robust losses), the forms of invariance that networks achieve, representational differences between correctly and incorrectly classified images, and the effects of manipulating logits and images.
Contrastive learning between multiple views of the data has recently achieved state of the art performance in the field of self-supervised representation learning.
Ranked #2 on Contrastive Learning on imagenet-1k
Contrastive learning applied to self-supervised representation learning has seen a resurgence in recent years, leading to state of the art performance in the unsupervised training of deep image models.
Ranked #2 on Class Incremental Learning on cifar100
The focus of recent meta-learning research has been on the development of learning algorithms that can quickly adapt to test time tasks with limited data and low computational cost.
We demonstrate that this objective ignores important structural knowledge of the teacher network.
Ranked #10 on Knowledge Distillation on CIFAR-100
Image extension models have broad applications in image editing, computational photography and computer graphics.
Ranked #2 on Uncropping on Places2 val
Using this regularizer, we exceed current state of the art and achieve 47% adversarial accuracy for ImageNet with l-infinity adversarial perturbations of radius 4/255 under an untargeted, strong, white-box attack.
We analyze key properties of the approach that make it work, finding that the contrastive loss outperforms a popular alternative based on cross-view prediction, and that the more views we learn from, the better the resulting representation captures underlying scene semantics.
Ranked #47 on Self-Supervised Action Recognition on UCF101
This operator can learn a strict super-set of what can be learned by average pooling or convolutions.
In this paper, we propose such a measure, and conduct extensive empirical studies on how well it can predict the generalization gap.
We study the problem of reconstructing an image from information stored at contour locations.
We present a method for synthesizing a frontal, neutral-expression image of a person's face given an input face photograph.
Collecting well-annotated image datasets to train modern machine learning algorithms is prohibitively expensive for many tasks.
However, by focusing only on creating a mapping or shared representation between the two domains, they ignore the individual characteristics of each domain.
Ranked #1 on Domain Adaptation on Synth Objects-to-LINEMOD
We demonstrate that this frame- work works well on two important mid-level vision tasks: intrinsic image decomposition and depth from an RGB im- age.
We propose a self-supervised framework that learns to group visual entities based on their rate of co-occurrence in space and time.