Self-Supervised Learning

SimCLR is a framework for contrastive learning of visual representations. It learns representations by maximizing agreement between differently augmented views of the same data example via a contrastive loss in the latent space. It consists of:

  • A stochastic data augmentation module that transforms any given data example randomly resulting in two correlated views of the same example, denoted $\mathbf{\tilde{x}_{i}}$ and $\mathbf{\tilde{x}_{j}}$, which is considered a positive pair. SimCLR sequentially applies three simple augmentations: random cropping followed by resize back to the original size, random color distortions, and random Gaussian blur. The authors find random crop and color distortion is crucial to achieve good performance.

  • A neural network base encoder $f\left(·\right)$ that extracts representation vectors from augmented data examples. The framework allows various choices of the network architecture without any constraints. The authors opt for simplicity and adopt ResNet to obtain $h_{i} = f\left(\mathbf{\tilde{x}}_{i}\right) = \text{ResNet}\left(\mathbf{\tilde{x}}_{i}\right)$ where $h_{i} \in \mathbb{R}^{d}$ is the output after the average pooling layer.

  • A small neural network projection head $g\left(·\right)$ that maps representations to the space where contrastive loss is applied. Authors use a MLP with one hidden layer to obtain $z_{i} = g\left(h_{i}\right) = W^{(2)}\sigma\left(W^{(1)}h_{i}\right)$ where $\sigma$ is a ReLU nonlinearity. The authors find it beneficial to define the contrastive loss on $z_{i}$’s rather than $h_{i}$’s.

  • A contrastive loss function defined for a contrastive prediction task. Given a set {$\mathbf{\tilde{x}}_{k}$} including a positive pair of examples $\mathbf{\tilde{x}}_{i}$ and $\mathbf{\tilde{x}_{j}}$ , the contrastive prediction task aims to identify $\mathbf{\tilde{x}}_{j}$ in {$\mathbf{\tilde{x}}_{k}$}$_{k\neq{i}}$ for a given $\mathbf{\tilde{x}}_{i}$.

A minibatch of $N$ examples is randomly sampled and the contrastive prediction task is defined on pairs of augmented examples derived from the minibatch, resulting in $2N$ data points. Negative examples are not sampled explicitly. Instead, given a positive pair, the other $2(N − 1)$ augmented examples within a minibatch are treated as negative examples. A NT-Xent (the normalized temperature-scaled cross entropy loss) loss function is used (see components).

Source: A Simple Framework for Contrastive Learning of Visual Representations


Paper Code Results Date Stars