ALBEF

Introduced by Li et al. in Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

ALBEF introduces a contrastive loss to align the image and text representations before fusing them through cross-modal attention. This enables more grounded vision and language representation learning. ALBEF also doesn't require bounding box annotations. The model consists of an image encode, a text encoder, and a multimodal encoder. The image-text contrastive loss helps to align the unimodal representations of an image-text pair before fusion. The image-text matching loss and a masked language modeling loss are applied to learn multimodal interactions between image and text. In addition, momentum distillation is used to generate pseudo-targets. This improves learning with noisy data.

Source: Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Retrieval	5	12.50%
Visual Question Answering (VQA)	4	10.00%
Language Modelling	3	7.50%
Question Answering	3	7.50%
Visual Question Answering	3	7.50%
Visual Grounding	2	5.00%
Image Retrieval	2	5.00%
Image-text matching	2	5.00%
Visual Reasoning	2	5.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Vision and Language Pre-Trained Models