This is the task of image classification using representations learnt with self-supervised learning. Self-supervised methods generally involve a pretext task that is solved to learn a good representation and a loss function to learn with. One example of a loss function is an autoencoder based loss where the goal is reconstruction of an image pixel-by-pixel. A more popular recent example is a contrastive loss, which measure the similarity of sample pairs in a representation space, and where there can be a varying target instead of a fixed target to reconstruct (as in the case of autoencoders).
A common evaluation protocol is to train a linear classifier on top of (frozen) representations learnt by self-supervised methods. The leaderboards for the linear evaluation protocol can be found below. In practice, it is more common to fine-tune features on a downstream task. An alternative evaluation protocol therefore uses semi-supervised learning and finetunes on a % of the labels. The leaderboards for the finetuning protocol can be accessed here.
You may want to read some blog posts before reading the papers and checking the leaderboards:
There is also Yann LeCun's talk at AAAI-20 which you can watch here (35:00+).
( Image credit: A Simple Framework for Contrastive Learning of Visual Representations )
|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets).
Ranked #1 on Copy Detection on Copydays strong subset
In this work, we go back to basics and investigate the effects of several fundamental components for training self-supervised ViT.
Ranked #1 on Self-Supervised Image Classification on ImageNet
This causes the representation vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors.
Ranked #1 on Image Classification on Places205
Recently, self-supervised learning methods like MoCo, SimCLR, BYOL and SwAV have reduced the gap with supervised methods.
Ranked #1 on Self-Supervised Image Classification on ImageNet (finetuned) (using extra training data)
With this in mind, we propose a teacher-student scheme to learn representations by training a convnet to reconstruct a bag-of-visual-words (BoW) representation of an image, given as input a perturbed version of that same image.
Ranked #13 on Semi-Supervised Image Classification on ImageNet - 1% labeled data (Top 5 Accuracy metric)
To the best of our knowledge, this is the first time a self-supervised AlexNet has outperformed supervised one on ImageNet classification.
Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images.
Ranked #11 on Image Classification on STL-10 (using extra training data)
The proposed semi-supervised learning algorithm can be summarized in three steps: unsupervised pretraining of a big ResNet model using SimCLRv2, supervised fine-tuning on a few labeled examples, and distillation with unlabeled examples for refining and transferring the task-specific knowledge.
In addition, we also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements much.
Ranked #4 on Self-Supervised Image Classification on ImageNet (finetuned) (using extra training data)
From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view.
Ranked #5 on Self-Supervised Image Classification on ImageNet