Generative Pretraining from Pixels

ICML 2020  ·  Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, Ilya Sutskever ·

Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure... Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full finetuning, matching the top supervised pre-trained models. An even larger model trained on a mixture of ImageNet and web images is competitive with self-supervised benchmarks on ImageNet, achieving 72.0% top-1 accuracy on a linear probe of our features. read more

PDF Abstract ICML 2020 PDF


Results from the Paper

Ranked #11 on Image Classification on STL-10 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Self-Supervised Image Classification ImageNet iGPT-L (32x32) Top 1 Accuracy 60.3% # 68
Self-Supervised Image Classification ImageNet iGPT-L (48x48) Top 1 Accuracy 65.2% # 59
Self-Supervised Image Classification ImageNet iGPT-XL (64x64, 3072 features) Top 1 Accuracy 68.7% # 52
Self-Supervised Image Classification ImageNet iGPT-XL (64x64, 15360 features) Top 1 Accuracy 72.0% # 45
Image Classification STL-10 AMDIM-L Percentage correct 94.2 # 19
Image Classification STL-10 iGPT-L Percentage correct 97.1 # 11