self-DIstillation with NO labels

Introduced by Caron et al. in Emerging Properties in Self-Supervised Vision Transformers

DINO (self-distillation with no labels) is a self-supervised learning method that directly predicts the output of a teacher network - built with a momentum encoder - using a standard cross-entropy loss.

In the example to the right, DINO is illustrated in the case of one single pair of views $\left(x_{1}, x_{2}\right)$ for simplicity. The model passes two different random transformations of an input image to the student and teacher networks. Both networks have the same architecture but other parameters. The output of the teacher network is centered with a mean computed over the batch. Each network outputs a $K$ dimensional feature normalized with a temperature softmax over the feature dimension. Their similarity is then measured with a cross-entropy loss. A stop-gradient (sg) operator is applied to the teacher to propagate gradients only through the student. The teacher parameters are updated with the student parameters' exponential moving average (ema).

Source: Emerging Properties in Self-Supervised Vision Transformers

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Semantic Segmentation	26	13.90%
Self-Supervised Learning	16	8.56%
Object Detection	13	6.95%
Instance Segmentation	8	4.28%
Clustering	6	3.21%
Image Classification	6	3.21%
Image Segmentation	4	2.14%
Retrieval	4	2.14%
Unsupervised Semantic Segmentation	3	1.60%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Vision Transformer	Image Models

Categories

Add Remove

Self-Supervised Learning

Vision Transformers