SimCLR

Introduced by Chen et al. in A Simple Framework for Contrastive Learning of Visual Representations

SimCLR is a framework for contrastive learning of visual representations. It learns representations by maximizing agreement between differently augmented views of the same data example via a contrastive loss in the latent space. It consists of:

A stochastic data augmentation module that transforms any given data example randomly resulting in two correlated views of the same example, denoted $\mathbf{\tilde{x}_{i}}$ and $\mathbf{\tilde{x}_{j}}$, which is considered a positive pair. SimCLR sequentially applies three simple augmentations: random cropping followed by resize back to the original size, random color distortions, and random Gaussian blur. The authors find random crop and color distortion is crucial to achieve good performance.
A neural network base encoder $f\left(·\right)$ that extracts representation vectors from augmented data examples. The framework allows various choices of the network architecture without any constraints. The authors opt for simplicity and adopt ResNet to obtain $h_{i} = f\left(\mathbf{\tilde{x}}_{i}\right) = \text{ResNet}\left(\mathbf{\tilde{x}}_{i}\right)$ where $h_{i} \in \mathbb{R}^{d}$ is the output after the average pooling layer.
A small neural network projection head $g\left(·\right)$ that maps representations to the space where contrastive loss is applied. Authors use a MLP with one hidden layer to obtain $z_{i} = g\left(h_{i}\right) = W^{(2)}\sigma\left(W^{(1)}h_{i}\right)$ where $\sigma$ is a ReLU nonlinearity. The authors find it beneficial to define the contrastive loss on $z_{i}$’s rather than $h_{i}$’s.
A contrastive loss function defined for a contrastive prediction task. Given a set {$\mathbf{\tilde{x}}_{k}$} including a positive pair of examples $\mathbf{\tilde{x}}_{i}$ and $\mathbf{\tilde{x}_{j}}$ , the contrastive prediction task aims to identify $\mathbf{\tilde{x}}_{j}$ in {$\mathbf{\tilde{x}}_{k}$}$_{k\neq{i}}$ for a given $\mathbf{\tilde{x}}_{i}$.

A minibatch of $N$ examples is randomly sampled and the contrastive prediction task is defined on pairs of augmented examples derived from the minibatch, resulting in $2N$ data points. Negative examples are not sampled explicitly. Instead, given a positive pair, the other $2(N − 1)$ augmented examples within a minibatch are treated as negative examples. A NT-Xent (the normalized temperature-scaled cross entropy loss) loss function is used (see components).

Source: A Simple Framework for Contrastive Learning of Visual Representations

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Self-Supervised Learning	103	28.85%
Image Classification	18	5.04%
Semantic Segmentation	10	2.80%
Object Detection	9	2.52%
Classification	8	2.24%
Retrieval	8	2.24%
Activity Recognition	7	1.96%
Human Activity Recognition	7	1.96%
General Classification	7	1.96%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
ColorJitter	Image Data Augmentation
Feedforward Network	Feedforward Networks
NT-Xent	Loss Functions
Random Gaussian Blur	Image Data Augmentation
Random Resized Crop	Image Data Augmentation
ReLU	Activation Functions
ResNet	Convolutional Neural Networks

Categories

Add Remove

Self-Supervised Learning