The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Semantic Segmentation ADE20K DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former) Validation mIoU 60.2 # 11
Params (M) 1100 # 6
Image Retrieval AmsterTime DINOv2 distilled (ViT-S/14 frozen) mAP 43.5 # 4
Image Retrieval AmsterTime DINOv2 distilled (ViT-B/14 frozen) mAP 45.6 # 3
Image Retrieval AmsterTime DINOv2 (ViT-g/14 frozen) mAP 46.7 # 2
Image Retrieval AmsterTime DINOv2 distilled (ViT-L/14 frozen) mAP 50.0 # 1
Image Classification CIFAR-10 DINOv2 (ViT-g/14, frozen model, linear eval) Percentage correct 99.5 # 1
PARAMS 1100M # 237
Self-Supervised Image Classification ImageNet DINOv2 (ViT-g/14 @448) Top 1 Accuracy 86.7% # 1
Number of Params 1100M # 3
Self-Supervised Image Classification ImageNet DINOv2 distilled (ViT-B/14) Top 1 Accuracy 84.5% # 5
Number of Params 85M # 38
Self-Supervised Image Classification ImageNet DINOv2 distilled (ViT-L/14) Top 1 Accuracy 86.3% # 3
Number of Params 307M # 16
Self-Supervised Image Classification ImageNet DINOv2 (ViT-g/14) Top 1 Accuracy 86.5% # 2
Number of Params 1100M # 3
Self-Supervised Image Classification ImageNet DINOv2 distilled (ViT-S/14) Top 1 Accuracy 81.1% # 18
Number of Params 21M # 77
Domain Generalization ImageNet-C DINOv2 (ViT-g/14, frozen model, linear eval) mean Corruption Error (mCE) 28.2 # 1
Number of params 1100M # 43
Domain Generalization ImageNet-C DINOv2 (ViT-S/14, frozen model, linear eval) mean Corruption Error (mCE) 54.4 # 30
Number of params 21M # 29
Domain Generalization ImageNet-C DINOv2 (ViT-B/14, frozen model, linear eval) mean Corruption Error (mCE) 42.7 # 19
Number of params 85M # 34
Domain Generalization ImageNet-C DINOv2 (ViT-L/14, frozen model, linear eval) mean Corruption Error (mCE) 31.5 # 4
Number of params 307M # 39
Self-Supervised Image Classification ImageNet (finetuned) DINOv2 (ViT-g/14, 448) Number of Params 1100M # 2
Top 1 Accuracy 88.9% # 1
Self-Supervised Image Classification ImageNet (finetuned) DINOv2 (ViT-g/14) Number of Params 1100M # 2
Top 1 Accuracy 88.5% # 3
Monocular Depth Estimation KITTI Eigen split DINOv2 (ViT-g/14 frozen, w/ DPT decoder) absolute relative error 0.0652 # 32
RMSE 2.1128 # 20
Sq Rel 0.1797 # 4
RMSE log 0.0882 # 26
Delta < 1.25 0.968 # 22
Delta < 1.25^2 0.997 # 13
Delta < 1.25^3 0.9993 # 7
Depth Estimation NYU-Depth V2 DINOv2 (ViT-g/14 frozen, w/ DPT decoder) RMS 0.279 # 2
Monocular Depth Estimation NYU-Depth V2 DINOv2 (ViT-g/14 frozen, w/ DPT decoder) RMSE 0.279 # 12
absolute relative error 0.0907 # 25
Delta < 1.25 0.9497 # 12
Delta < 1.25^2 0.996 # 5
Delta < 1.25^3 0.9994 # 2
log 10 0.0371 # 19
Fine-Grained Image Classification Oxford-IIIT Pet Dataset DINOv2 (ViT-g/14, frozen model, linear eval) Accuracy 96.7 # 2

Results from Other Papers


Task Dataset Model Metric Name Metric Value Rank Uses Extra
Training Data
Source Paper Compare
Visual Place Recognition 17 Places DINOv2 Recall@1 61.82 # 3
Visual Place Recognition Baidu Mall DINOv2 Recall@1 49.21 # 5
Visual Place Recognition Gardens Point DINOv2 Recall@1 71.50 # 5
Visual Place Recognition Hawkins DINOv2 Recall@1 27.97 # 6
Visual Place Recognition Laurel Caverns DINOv2 Recall@1 40.18 # 3
Visual Place Recognition Mid-Atlantic Ridge DINOv2 Recall@1 24.75 # 6
Visual Place Recognition Nardo-Air DINOv2 Recall@1 73.24 # 2
Visual Place Recognition Nardo-Air R DINOv2 Recall@1 71.83 # 6
Visual Place Recognition Oxford RobotCar Dataset DINOv2 Recall@1 39.79 # 5
Visual Place Recognition Pittsburgh-30k-test DINOv2 Recall@1 78.32 # 10
Visual Place Recognition St Lucia DINOv2 Recall@1 78.62 # 5
Visual Place Recognition VP-Air DINOv2 Recall@1 45.23 # 2

Methods