MoCo v3

Introduced by Chen et al. in An Empirical Study of Training Self-Supervised Vision Transformers

MoCo v3 aims to stabilize training of self-supervised ViTs. MoCo v3 is an incremental improvement of MoCo v1/2. Two crops are used for each image under random data augmentation. They are encoded by two encoders $f_q$ and $f_k$ with output vectors $q$ and $k$. $q$ behaves like a "query", where the goal of learning is to retrieve the corresponding "key". The objective is to minimize a contrastive loss function of the following form:

$$ \mathcal{L_q}=-\log \frac{\exp \left(q \cdot k^{+} / \tau\right)}{\exp \left(q \cdot k^{+} / \tau\right)+\sum_{k^{-}} \exp \left(q \cdot k^{-} / \tau\right)} $$

This approach aims to train the Transformer in the contrastive/Siamese paradigm. The encoder $f_q$ consists of a backbone (e.g., ResNet and ViT), a projection head, and an extra prediction head. The encoder $f_k$ has the back the backbone and projection head but not the prediction head. $f_k$ is updated by the moving average of $f_q$, excluding the prediction head.

Source: An Empirical Study of Training Self-Supervised Vision Transformers

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Self-Supervised Learning	5	16.67%
Self-Supervised Image Classification	3	10.00%
Image Classification	2	6.67%
Semantic Segmentation	2	6.67%
Visual Prompt Tuning	1	3.33%
Classification	1	3.33%
Atari Games	1	3.33%
Reinforcement Learning (RL)	1	3.33%
Image Captioning	1	3.33%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Vision Transformers