37 papers with code • 2 benchmarks • 1 datasets
This is the task of image classification using representations learnt with self-supervised learning. Self-supervised methods generally involve a pretext task that is solved to learn a good representation and a loss function to learn with. One example of a loss function is an autoencoder based loss where the goal is reconstruction of an image pixel-by-pixel. A more popular recent example is a contrastive loss, which measure the similarity of sample pairs in a representation space, and where there can be a varying target instead of a fixed target to reconstruct (as in the case of autoencoders).
A common evaluation protocol is to train a linear classifier on top of (frozen) representations learnt by self-supervised methods. The leaderboards for the linear evaluation protocol can be found below. In practice, it is more common to fine-tune features on a downstream task. An alternative evaluation protocol therefore uses semi-supervised learning and finetunes on a % of the labels. The leaderboards for the finetuning protocol can be accessed here.
You may want to read some blog posts before reading the papers and checking the leaderboards:
There is also Yann LeCun's talk at AAAI-20 which you can watch here (35:00+).
( Image credit: A Simple Framework for Contrastive Learning of Visual Representations )
Many recent methods for unsupervised or self-supervised representation learning train feature extractors by maximizing an estimate of the mutual information (MI) between different views of the data.
We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries.
Ranked #3 on Self-Supervised Image Classification on ImageNet
We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification.
Ranked #5 on Image Classification on ImageNet V2 (using extra training data)
From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view.
In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets).
Ranked #1 on Video Object Detection on DAVIS 2017
This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning.
Ranked #53 on Self-Supervised Image Classification on ImageNet
We embrace the underlying uncertainty of the problem by posing it as a classification task and use class-rebalancing at training time to increase the diversity of colors in the result.
Ranked #80 on Self-Supervised Image Classification on ImageNet