Generative Pretraining from Pixels

Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full finetuning, matching the top supervised pre-trained models. An even larger model trained on a mixture of ImageNet and web images is competitive with self-supervised benchmarks on ImageNet, achieving 72.0% top-1 accuracy on a linear probe of our features.

PDF Abstract ICML 2020 PDF

Datasets


Results from the Paper


Ranked #15 on Image Classification on STL-10 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Self-Supervised Image Classification ImageNet iGPT-XL (64x64, 15360 features) Top 1 Accuracy 72.0% # 91
Number of Params 6801M # 1
Self-Supervised Image Classification ImageNet iGPT-XL (64x64, 3072 features) Top 1 Accuracy 68.7% # 98
Self-Supervised Image Classification ImageNet iGPT-L (48x48) Top 1 Accuracy 65.2% # 106
Self-Supervised Image Classification ImageNet iGPT-L (32x32) Top 1 Accuracy 60.3% # 116
Image Classification STL-10 iGPT-L Percentage correct 97.1 # 15
Image Classification STL-10 AMDIM-L Percentage correct 94.2 # 24

Methods