Revisiting Unreasonable Effectiveness of Data in Deep Learning Era

The success of deep learning in vision can be attributed to: (a) models with high capacity; (b) increased computational power; and (c) availability of large-scale labeled data. Since 2012, there have been significant advances in representation capabilities of the models and computational capabilities of GPUs. But the size of the biggest dataset has surprisingly remained constant. What will happen if we increase the dataset size by 10x or 100x? This paper takes a step towards clearing the clouds of mystery surrounding the relationship between `enormous data' and visual deep learning. By exploiting the JFT-300M dataset which has more than 375M noisy labels for 300M images, we investigate how the performance of current vision tasks would change if this data was used for representation learning. Our paper delivers some surprising (and some expected) findings. First, we find that the performance on vision tasks increases logarithmically based on volume of training data size. Second, we show that representation learning (or pre-training) still holds a lot of promise. One can improve performance on many vision tasks by just training a better base model. Finally, as expected, we present new state-of-the-art results for different vision tasks including image classification, object detection, semantic segmentation and human pose estimation. Our sincere hope is that this inspires vision community to not undervalue the data and develop collective efforts in building larger datasets.

PDF Abstract ICCV 2017 PDF ICCV 2017 Abstract

Datasets


Introduced in the Paper:

JFT-300M

Used in the Paper:

ImageNet MS COCO PASCAL VOC 2007
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Object Detection COCO test-dev Faster R-CNN (ImageNet+300M) box mAP 37.4 # 201
AP50 58 # 138
AP75 40.1 # 140
APS 17.5 # 133
APM 41.1 # 128
APL 51.2 # 122
Hardware Burden None # 1
Operations per network pass None # 1
Pose Estimation COCO test-dev Faster R-CNN (ImageNet+300M) AP 64.4 # 38
AP50 85.7 # 38
AP75 70.7 # 34
APL 69.8 # 37
APM 61.8 # 31
Image Classification ImageNet ResNet-101 (JFT-300M Finetuning) Top 1 Accuracy 79.2% # 710
Semantic Segmentation PASCAL VOC 2007 DeepLabv3 (ImageNet+300M) Mean IoU 81.3 # 2
Semantic Segmentation PASCAL VOC 2012 val DeepLabv3 (ImageNet+300M) mIoU 76.5% # 19

Methods