Exploring the Limits of Weakly Supervised Pretraining

State-of-the-art visual perception models for a wide range of tasks rely on supervised pretraining. ImageNet classification is the de facto pretraining task for these models. Yet, ImageNet is now nearly ten years old and is by modern standards "small". Even so, relatively little is known about the behavior of pretraining with datasets that are multiple orders of magnitude larger. The reasons are obvious: such datasets are difficult to collect and annotate. In this paper, we present a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images. Our experiments demonstrate that training for large-scale hashtag prediction leads to excellent results. We show improvements on several image classification and object detection tasks, and report the highest ImageNet-1k single-crop, top-1 accuracy to date: 85.4% (97.6% top-5). We also perform extensive experiments that provide novel empirical data on the relationship between large-scale pretraining and transfer learning performance.

PDF Abstract ECCV 2018 PDF ECCV 2018 Abstract

Datasets


Introduced in the Paper:

IG-3.5B-17k

Used in the Paper:

ImageNet COCO Places JFT-300M

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Classification ImageNet ResNeXt-101 32x32d Top 1 Accuracy 85.1% # 226
Top 5 Accuracy 97.5% # 27
Number of params 466M # 886
GFLOPs 174 # 452
Image Classification ImageNet ResNeXt-101 32x48d Top 1 Accuracy 85.4% # 203
Top 5 Accuracy 97.6% # 24
Number of params 829M # 905
GFLOPs 306 # 463
Image Classification ImageNet ResNeXt-101 32×16d Top 1 Accuracy 84.2% # 286
Top 5 Accuracy 97.2% # 39
Number of params 194M # 849
GFLOPs 72 # 426
Image Classification ImageNet ResNeXt-101 32x8d Top 1 Accuracy 82.2% # 473
Top 5 Accuracy 96.4 # 78
Number of params 88M # 788

Methods