TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Place Recognition	17 Places	DINO	Recall@1	61.82	# 3
Visual Place Recognition	Baidu Mall	DINO	Recall@1	48.30	# 6
Copy Detection	Copydays strong subset	DINO (ViT-B/8)	mAP	85.5	# 2
Video Object Segmentation	DAVIS 2017	DINO (ViT-B/8, ImageNet retrain)	J&F	71.4	# 2
Visual Place Recognition	Gardens Point	DINO	Recall@1	78.50	# 3
Visual Place Recognition	Hawkins	DINO	Recall@1	46.61	# 2
Self-Supervised Image Classification	ImageNet	DINO (ResNet-50)	Top 1 Accuracy	75.3%	# 71
Self-Supervised Image Classification	ImageNet	DINO (ResNet-50)	Number of Params	24M	# 48
Self-Supervised Image Classification	ImageNet	DINO (ViT-B/16)	Top 1 Accuracy	78.2%	# 45
Self-Supervised Image Classification	ImageNet	DINO (ViT-B/16)	Number of Params	85M	# 38
Self-Supervised Image Classification	ImageNet	DINO (ViT-B/8)	Top 1 Accuracy	80.1%	# 27
Self-Supervised Image Classification	ImageNet	DINO (ViT-S/16)	Top 1 Accuracy	77.0%	# 53
Self-Supervised Image Classification	ImageNet	DINO (ViT-S/16)	Number of Params	21M	# 77
Self-Supervised Image Classification	ImageNet	DINO (ViT-S/8)	Top 1 Accuracy	79.7%	# 32
Self-Supervised Image Classification	ImageNet	DINO (ViT-S/8)	Number of Params	21M	# 77
Self-Supervised Image Classification	ImageNet	DINO (xcit_medium_24_p8)	Top 1 Accuracy	80.3%	# 24
Self-Supervised Image Classification	ImageNet	DINO (xcit_medium_24_p8)	Number of Params	84M	# 42
Self-Supervised Image Classification	ImageNet (finetuned)	DINO (ViT-B/16)	Number of Params	85M	# 39
Self-Supervised Image Classification	ImageNet (finetuned)	DINO (ViT-B/16)	Top 1 Accuracy	82.8%	# 48
Visual Place Recognition	Laurel Caverns	DINO	Recall@1	41.07	# 2
Visual Place Recognition	Mid-Atlantic Ridge	DINO	Recall@1	27.72	# 2
Visual Place Recognition	Nardo-Air	DINO	Recall@1	57.75	# 3
Visual Place Recognition	Nardo-Air R	DINO	Recall@1	84.51	# 4
Image Classification	OmniBenchmark	DINO	Average Top-1 Accuracy	38.9	# 8
Visual Place Recognition	Oxford RobotCar Dataset	DINO	Recall@1	15.71	# 7
Visual Place Recognition	Pittsburgh-30k-test	DINO	Recall@1	70.13	# 11
Image Retrieval	ROxford (Hard)	Dino	mAP	24.3	# 18
Image Retrieval	ROxford (Medium)	Dino	mAP	51.5	# 18
Image Retrieval	RParis (Hard)	Dino	mAP	51.6	# 12
Image Retrieval	RParis (Medium)	Dino	mAP	75.3	# 12
Visual Place Recognition	St Lucia	DINO	Recall@1	45.22	# 8
Visual Place Recognition	VP-Air	DINO	Recall@1	24.02	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/emerging-properties-in-self-supervised-vision/copy-detection-on-copydays-strong-subset)](https://paperswithcode.com/sota/copy-detection-on-copydays-strong-subset?p=emerging-properties-in-self-supervised-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/emerging-properties-in-self-supervised-vision/video-object-segmentation-on-davis-2017)](https://paperswithcode.com/sota/video-object-segmentation-on-davis-2017?p=emerging-properties-in-self-supervised-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/emerging-properties-in-self-supervised-vision/visual-place-recognition-on-hawkins)](https://paperswithcode.com/sota/visual-place-recognition-on-hawkins?p=emerging-properties-in-self-supervised-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/emerging-properties-in-self-supervised-vision/visual-place-recognition-on-laurel-caverns)](https://paperswithcode.com/sota/visual-place-recognition-on-laurel-caverns?p=emerging-properties-in-self-supervised-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/emerging-properties-in-self-supervised-vision/visual-place-recognition-on-mid-atlantic)](https://paperswithcode.com/sota/visual-place-recognition-on-mid-atlantic?p=emerging-properties-in-self-supervised-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/emerging-properties-in-self-supervised-vision/visual-place-recognition-on-17-places)](https://paperswithcode.com/sota/visual-place-recognition-on-17-places?p=emerging-properties-in-self-supervised-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/emerging-properties-in-self-supervised-vision/visual-place-recognition-on-gardens-point)](https://paperswithcode.com/sota/visual-place-recognition-on-gardens-point?p=emerging-properties-in-self-supervised-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/emerging-properties-in-self-supervised-vision/visual-place-recognition-on-nardo-air)](https://paperswithcode.com/sota/visual-place-recognition-on-nardo-air?p=emerging-properties-in-self-supervised-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/emerging-properties-in-self-supervised-vision/visual-place-recognition-on-nardo-air-r)](https://paperswithcode.com/sota/visual-place-recognition-on-nardo-air-r?p=emerging-properties-in-self-supervised-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/emerging-properties-in-self-supervised-vision/visual-place-recognition-on-vp-air)](https://paperswithcode.com/sota/visual-place-recognition-on-vp-air?p=emerging-properties-in-self-supervised-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/emerging-properties-in-self-supervised-vision/visual-place-recognition-on-baidu-mall)](https://paperswithcode.com/sota/visual-place-recognition-on-baidu-mall?p=emerging-properties-in-self-supervised-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/emerging-properties-in-self-supervised-vision/visual-place-recognition-on-oxford-robotcar-4)](https://paperswithcode.com/sota/visual-place-recognition-on-oxford-robotcar-4?p=emerging-properties-in-self-supervised-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/emerging-properties-in-self-supervised-vision/image-classification-on-omnibenchmark)](https://paperswithcode.com/sota/image-classification-on-omnibenchmark?p=emerging-properties-in-self-supervised-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/emerging-properties-in-self-supervised-vision/visual-place-recognition-on-st-lucia)](https://paperswithcode.com/sota/visual-place-recognition-on-st-lucia?p=emerging-properties-in-self-supervised-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/emerging-properties-in-self-supervised-vision/visual-place-recognition-on-pittsburgh-30k)](https://paperswithcode.com/sota/visual-place-recognition-on-pittsburgh-30k?p=emerging-properties-in-self-supervised-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/emerging-properties-in-self-supervised-vision/image-retrieval-on-rparis-hard)](https://paperswithcode.com/sota/image-retrieval-on-rparis-hard?p=emerging-properties-in-self-supervised-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/emerging-properties-in-self-supervised-vision/image-retrieval-on-rparis-medium)](https://paperswithcode.com/sota/image-retrieval-on-rparis-medium?p=emerging-properties-in-self-supervised-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/emerging-properties-in-self-supervised-vision/image-retrieval-on-roxford-hard)](https://paperswithcode.com/sota/image-retrieval-on-roxford-hard?p=emerging-properties-in-self-supervised-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/emerging-properties-in-self-supervised-vision/image-retrieval-on-roxford-medium)](https://paperswithcode.com/sota/image-retrieval-on-roxford-medium?p=emerging-properties-in-self-supervised-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/emerging-properties-in-self-supervised-vision/self-supervised-image-classification-on)](https://paperswithcode.com/sota/self-supervised-image-classification-on?p=emerging-properties-in-self-supervised-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/emerging-properties-in-self-supervised-vision/self-supervised-image-classification-on-1)](https://paperswithcode.com/sota/self-supervised-image-classification-on-1?p=emerging-properties-in-self-supervised-vision)`

Emerging Properties in Self-Supervised Vision Transformers

ICCV 2021 · Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin ·

In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. Our study also underlines the importance of momentum encoder, multi-crop training, and the use of small patches with ViTs. We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract

Code

Add Remove Mark official

facebookresearch/dino official

5,836

facebookresearch/vissl

↳ Quickstart in

Colab

3,227

lightly-ai/lightly

2,741

alibaba/EasyCV

1,676

vturrisi/solo-learn

1,355

See all 26 implementations

Tasks

Add Remove

Copy Detection

Image Classification

Image Retrieval

Self-Supervised Image Classification

Self-Supervised Learning

Semantic Segmentation

Single-object discovery

Video Object Detection

Video Object Segmentation

Visual Place Recognition

Datasets

ImageNet

DAVIS

DAVIS 2017

YFCC100M

Oxford RobotCar Dataset

OmniBenchmark

Results from the Paper

Edit

Ranked #2 on Copy Detection on Copydays strong subset

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Place Recognition	17 Places	DINO	Recall@1	61.82	# 3	Compare
Copy Detection	Copydays strong subset	DINO (ViT-B/8)	mAP	85.5	# 2	Compare
Video Object Segmentation	DAVIS 2017	DINO (ViT-B/8, ImageNet retrain)	J&F	71.4	# 2	Compare
Visual Place Recognition	Gardens Point	DINO	Recall@1	78.50	# 3	Compare
Self-Supervised Image Classification	ImageNet	DINO (ResNet-50)	Top 1 Accuracy	75.3%	# 71	Compare
Self-Supervised Image Classification	ImageNet	DINO (ResNet-50)	Number of Params	24M	# 48	Compare
Self-Supervised Image Classification	ImageNet	DINO (ViT-B/16)	Top 1 Accuracy	78.2%	# 45	Compare
Self-Supervised Image Classification	ImageNet	DINO (ViT-B/16)	Number of Params	85M	# 38	Compare
Self-Supervised Image Classification	ImageNet	DINO (ViT-B/8)	Top 1 Accuracy	80.1%	# 27	Compare
Self-Supervised Image Classification	ImageNet	DINO (ViT-S/16)	Top 1 Accuracy	77.0%	# 53	Compare
Self-Supervised Image Classification	ImageNet	DINO (ViT-S/16)	Number of Params	21M	# 77	Compare
Self-Supervised Image Classification	ImageNet	DINO (ViT-S/8)	Top 1 Accuracy	79.7%	# 32	Compare
Self-Supervised Image Classification	ImageNet	DINO (ViT-S/8)	Number of Params	21M	# 77	Compare
Self-Supervised Image Classification	ImageNet	DINO (xcit_medium_24_p8)	Top 1 Accuracy	80.3%	# 24	Compare
Self-Supervised Image Classification	ImageNet	DINO (xcit_medium_24_p8)	Number of Params	84M	# 42	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	DINO (ViT-B/16)	Number of Params	85M	# 39	Compare
Self-Supervised Image Classification	ImageNet (finetuned)	DINO (ViT-B/16)	Top 1 Accuracy	82.8%	# 48	Compare
Image Classification	OmniBenchmark	DINO	Average Top-1 Accuracy	38.9	# 8	Compare
Image Retrieval	ROxford (Hard)	Dino	mAP	24.3	# 18	Compare
Image Retrieval	ROxford (Medium)	Dino	mAP	51.5	# 18	Compare
Image Retrieval	RParis (Hard)	Dino	mAP	51.6	# 12	Compare
Image Retrieval	RParis (Medium)	Dino	mAP	75.3	# 12	Compare

Results from Other Papers

Task	Dataset	Model	Metric Name	Metric Value	Rank	Compare
Visual Place Recognition	Baidu Mall	DINO	Recall@1	48.30	# 6	See all
Visual Place Recognition	Hawkins	DINO	Recall@1	46.61	# 2	See all
Visual Place Recognition	Laurel Caverns	DINO	Recall@1	41.07	# 2	See all
Visual Place Recognition	Mid-Atlantic Ridge	DINO	Recall@1	27.72	# 2	See all
Visual Place Recognition	Nardo-Air	DINO	Recall@1	57.75	# 3	See all
Visual Place Recognition	Nardo-Air R	DINO	Recall@1	84.51	# 4	See all
Visual Place Recognition	Oxford RobotCar Dataset	DINO	Recall@1	15.71	# 7	See all
Visual Place Recognition	Pittsburgh-30k-test	DINO	Recall@1	70.13	# 11	See all
Visual Place Recognition	St Lucia	DINO	Recall@1	45.22	# 8	See all
Visual Place Recognition	VP-Air	DINO	Recall@1	24.02	# 4	See all

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • DINO • Dropout • k-NN • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Vision Transformer

Edit Social Preview

Emerging Properties in Self-Supervised Vision Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit