Contrastive Language-Image Pre-training

Introduced by Radford et al. in Learning Transferable Visual Models From Natural Language Supervision

Contrastive Language-Image Pre-training (CLIP), consisting of a simplified version of ConVIRT trained from scratch, is an efficient method of image representation learning from natural language supervision. , CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes.

For pre-training, CLIP is trained to predict which of the $N X N$ possible (image, text) pairings across a batch actually occurred. CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the $N$ real pairs in the batch while minimizing the cosine similarity of the embeddings of the $N^2 - N$ incorrect pairings. A symmetric cross entropy loss is optimized over these similarity scores.

Image credit: Learning Transferable Visual Models From Natural Language Supervision

Source: Learning Transferable Visual Models From Natural Language Supervision

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	78	7.61%
Zero-Shot Learning	47	4.59%
Retrieval	46	4.49%
Semantic Segmentation	41	4.00%
Image Generation	40	3.90%
Image Classification	32	3.12%
Large Language Model	23	2.24%
Image Captioning	22	2.15%
Visual Question Answering	17	1.66%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Image Representations

Vision and Language Pre-Trained Models