DINOv2: Learning Robust Visual Features without Supervision
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.
PDF AbstractCode
Results from the Paper
Ranked #1 on
Image Retrieval
on AmsterTime
(using extra training data)
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Uses Extra Training Data |
Benchmark |
---|---|---|---|---|---|---|---|
Semantic Segmentation | ADE20K | DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former) | Validation mIoU | 60.2 | # 11 | ||
Params (M) | 1080 | # 8 | |||||
Image Retrieval | AmsterTime | DINOv2 distilled (ViT-S/14 frozen) | mAP | 43.5 | # 4 | ||
Image Retrieval | AmsterTime | DINOv2 distilled (ViT-L/14 frozen) | mAP | 50.0 | # 1 | ||
Image Retrieval | AmsterTime | DINOv2 distilled (ViT-B/14 frozen) | mAP | 45.6 | # 3 | ||
Image Retrieval | AmsterTime | DINOv2 (ViT-g/14 frozen) | mAP | 46.7 | # 2 | ||
Image Classification | CIFAR-10 | DINOv2 (ViT-g/14, frozen model, linear eval) | Percentage correct | 99.5 | # 1 | ||
Self-Supervised Image Classification | ImageNet | DINOv2 (ViT-g/14 @448) | Top 1 Accuracy | 86.7% | # 2 | ||
Number of Params | 1100M | # 8 | |||||
Self-Supervised Image Classification | ImageNet | DINOv2 distilled (ViT-S/14) | Top 1 Accuracy | 81.1% | # 17 | ||
Number of Params | 21M | # 135 | |||||
Self-Supervised Image Classification | ImageNet | DINOv2 distilled (ViT-B/14) | Top 1 Accuracy | 84.5% | # 6 | ||
Number of Params | 85M | # 62 | |||||
Self-Supervised Image Classification | ImageNet | DINOv2 distilled (ViT-L/14) | Top 1 Accuracy | 86.3% | # 4 | ||
Number of Params | 307M | # 27 | |||||
Self-Supervised Image Classification | ImageNet | DINOv2 (ViT-g/14) | Top 1 Accuracy | 86.5% | # 3 | ||
Number of Params | 1100M | # 8 | |||||
Domain Generalization | ImageNet-C | DINOv2 (ViT-g/14, frozen model, linear eval) | mean Corruption Error (mCE) | 28.2 | # 1 | ||
Number of params | 1100M | # 47 | |||||
Domain Generalization | ImageNet-C | DINOv2 (ViT-S/14, frozen model, linear eval) | mean Corruption Error (mCE) | 54.4 | # 31 | ||
Number of params | 21M | # 32 | |||||
Domain Generalization | ImageNet-C | DINOv2 (ViT-B/14, frozen model, linear eval) | mean Corruption Error (mCE) | 42.7 | # 19 | ||
Number of params | 85M | # 37 | |||||
Domain Generalization | ImageNet-C | DINOv2 (ViT-L/14, frozen model, linear eval) | mean Corruption Error (mCE) | 31.5 | # 4 | ||
Number of params | 307M | # 43 | |||||
Self-Supervised Image Classification | ImageNet (finetuned) | DINOv2 (ViT-g/14) | Number of Params | 1100M | # 2 | ||
Top 1 Accuracy | 88.5% | # 3 | |||||
Self-Supervised Image Classification | ImageNet (finetuned) | DINOv2 (ViT-g/14, 448) | Number of Params | 1100M | # 2 | ||
Top 1 Accuracy | 88.9% | # 1 | |||||
Monocular Depth Estimation | KITTI Eigen split | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) | absolute relative error | 0.0652 | # 40 | ||
RMSE | 2.1128 | # 28 | |||||
Sq Rel | 0.1797 | # 27 | |||||
RMSE log | 0.0882 | # 34 | |||||
Delta < 1.25 | 0.968 | # 30 | |||||
Delta < 1.25^2 | 0.997 | # 21 | |||||
Delta < 1.25^3 | 0.9993 | # 12 | |||||
Depth Estimation | NYU-Depth V2 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) | RMS | 0.279 | # 2 | ||
Monocular Depth Estimation | NYU-Depth V2 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) | RMSE | 0.279 | # 20 | ||
absolute relative error | 0.0907 | # 40 | |||||
Delta < 1.25 | 0.9497 | # 25 | |||||
Delta < 1.25^2 | 0.996 | # 11 | |||||
Delta < 1.25^3 | 0.9994 | # 6 | |||||
log 10 | 0.0371 | # 26 | |||||
Fine-Grained Image Classification | Oxford-IIIT Pet Dataset | DINOv2 (ViT-g/14, frozen model, linear eval) | Accuracy | 96.7 | # 3 |