Contrastive Language-Image Pre-training (CLIP), consisting of a simplified version of ConVIRT trained from scratch, is an efficient method of image representation learning from natural language supervision. , CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes.
For pre-training, CLIP is trained to predict which of the $N X N$ possible (image, text) pairings across a batch actually occurred. CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the $N$ real pairs in the batch while minimizing the cosine similarity of the embeddings of the $N^2 - N$ incorrect pairings. A symmetric cross entropy loss is optimized over these similarity scores.
Image credit: Learning Transferable Visual Models From Natural Language Supervision
Source: Learning Transferable Visual Models From Natural Language SupervisionPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Language Modelling | 78 | 7.61% |
Zero-Shot Learning | 47 | 4.59% |
Retrieval | 46 | 4.49% |
Semantic Segmentation | 41 | 4.00% |
Image Generation | 40 | 3.90% |
Image Classification | 32 | 3.12% |
Large Language Model | 23 | 2.24% |
Image Captioning | 22 | 2.15% |
Visual Question Answering | 17 | 1.66% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |