DINO (self-distillation with no labels) is a self-supervised learning method that directly predicts the output of a teacher network - built with a momentum encoder - using a standard cross-entropy loss.
In the example to the right, DINO is illustrated in the case of one single pair of views $\left(x_{1}, x_{2}\right)$ for simplicity. The model passes two different random transformations of an input image to the student and teacher networks. Both networks have the same architecture but other parameters. The output of the teacher network is centered with a mean computed over the batch. Each network outputs a $K$ dimensional feature normalized with a temperature softmax over the feature dimension. Their similarity is then measured with a cross-entropy loss. A stop-gradient (sg) operator is applied to the teacher to propagate gradients only through the student. The teacher parameters are updated with the student parameters' exponential moving average (ema).
Source: Emerging Properties in Self-Supervised Vision TransformersPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Semantic Segmentation | 34 | 9.63% |
Object Detection | 30 | 8.50% |
Object | 23 | 6.52% |
Self-Supervised Learning | 22 | 6.23% |
Image Classification | 10 | 2.83% |
Instance Segmentation | 9 | 2.55% |
Image Generation | 6 | 1.70% |
Clustering | 6 | 1.70% |
Decoder | 5 | 1.42% |