In the ALIGN method, visual and language representations are jointly trained from noisy image alt-text data. The image and text encoders are learned via contrastive loss (formulated as normalized softmax) that pushes the embeddings of the matched image-text pair together and pushing those of non-matched image-text pair apart. The model learns to align visual and language representations of the image and text pairs using the contrastive loss. The representations can be used for vision-only or vision-language task transfer. Without any fine-tuning, ALIGN powers zero-shot visual classification and cross-modal search including image-to-text search, text-to image search and even search with joint image+text queries.
Source: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text SupervisionPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Domain Adaptation | 53 | 6.20% |
Semantic Segmentation | 38 | 4.44% |
Object Detection | 36 | 4.21% |
Unsupervised Domain Adaptation | 33 | 3.86% |
Language Modelling | 27 | 3.16% |
Question Answering | 21 | 2.46% |
Image Classification | 18 | 2.11% |
Image Generation | 13 | 1.52% |
Visual Question Answering (VQA) | 12 | 1.40% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |