Self-Supervised Image Classification
81 papers with code • 2 benchmarks • 1 datasets
This is the task of image classification using representations learnt with self-supervised learning. Self-supervised methods generally involve a pretext task that is solved to learn a good representation and a loss function to learn with. One example of a loss function is an autoencoder based loss where the goal is reconstruction of an image pixel-by-pixel. A more popular recent example is a contrastive loss, which measure the similarity of sample pairs in a representation space, and where there can be a varying target instead of a fixed target to reconstruct (as in the case of autoencoders).
A common evaluation protocol is to train a linear classifier on top of (frozen) representations learnt by self-supervised methods. The leaderboards for the linear evaluation protocol can be found below. In practice, it is more common to fine-tune features on a downstream task. An alternative evaluation protocol therefore uses semi-supervised learning and finetunes on a % of the labels. The leaderboards for the finetuning protocol can be accessed here.
You may want to read some blog posts before reading the papers and checking the leaderboards:
- Contrastive Self-Supervised Learning - Ankesh Anand
- The Illustrated Self-Supervised Learning - Amit Chaudhary
- Self-supervised learning and computer vision - Jeremy Howard
- Self-Supervised Representation Learning - Lilian Weng
There is also Yann LeCun's talk at AAAI-20 which you can watch here (35:00+).
( Image credit: A Simple Framework for Contrastive Learning of Visual Representations )
From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view.
The key insight of our model is to learn such representations by predicting the future in latent space by using powerful autoregressive models.
In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets).
This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors.