ALBEF introduces a contrastive loss to align the image and text representations before fusing them through cross-modal attention. This enables more grounded vision and language representation learning. ALBEF also doesn't require bounding box annotations. The model consists of an image encode, a text encoder, and a multimodal encoder. The image-text contrastive loss helps to align the unimodal representations of an image-text pair before fusion. The image-text matching loss and a masked language modeling loss are applied to learn multimodal interactions between image and text. In addition, momentum distillation is used to generate pseudo-targets. This improves learning with noisy data.
Source: Align before Fuse: Vision and Language Representation Learning with Momentum DistillationPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Retrieval | 7 | 11.29% |
Text Retrieval | 7 | 11.29% |
Question Answering | 5 | 8.06% |
Visual Question Answering | 5 | 8.06% |
Visual Question Answering (VQA) | 4 | 6.45% |
Visual Grounding | 3 | 4.84% |
Image Retrieval | 3 | 4.84% |
Language Modelling | 3 | 4.84% |
Cross-Modal Retrieval | 2 | 3.23% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |