Generative Pretraining from Pixels
Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full finetuning, matching the top supervised pre-trained models. An even larger model trained on a mixture of ImageNet and web images is competitive with self-supervised benchmarks on ImageNet, achieving 72.0% top-1 accuracy on a linear probe of our features.
PDF Abstract ICML 2020 PDFCode
Results from the Paper
Ranked #14 on
Image Classification
on STL-10
(using extra training data)
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Uses Extra Training Data |
Benchmark |
---|---|---|---|---|---|---|---|
Self-Supervised Image Classification | ImageNet | iGPT-XL (64x64, 15360 features) | Top 1 Accuracy | 72.0% | # 67 | ||
Number of Params | 6801M | # 1 | |||||
Self-Supervised Image Classification | ImageNet | iGPT-XL (64x64, 3072 features) | Top 1 Accuracy | 68.7% | # 74 | ||
Self-Supervised Image Classification | ImageNet | iGPT-L (48x48) | Top 1 Accuracy | 65.2% | # 82 | ||
Self-Supervised Image Classification | ImageNet | iGPT-L (32x32) | Top 1 Accuracy | 60.3% | # 92 | ||
Image Classification | STL-10 | iGPT-L | Percentage correct | 97.1 | # 14 | ||
Image Classification | STL-10 | AMDIM-L | Percentage correct | 94.2 | # 22 |