Vision-Language pretrained Model

Introduced by Bao et al. in VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

VLMo is a unified vision-language pre-trained model that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. A Mixture-of-Modality-Experts (MOME) transformer is introduced to encode different modalities which helps it to capture modality-specific information by modality experts, and align content of different modalities by the self-attention module shared across modalities. The model parameters are shared across image-text contrastive learning, masked language modeling, and image-text matching tasks. During fine-tuning, the flexible modeling allows for VLMO to be used as either a dual encoder (i.e., separately encode images and text for retrieval tasks) or a fusion encoder (i.e., jointly encode image-text pairs for better interaction across modalities) Stage-wise pretraining on image-only and text-only data improved the vision-language pre-trained model. The model can be used for classification tasks and fine-tuned as a dual encoder for retrieval tasks.

Source: VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Image Retrieval	1	25.00%
Retrieval	1	25.00%
Visual Question Answering (VQA)	1	25.00%
Visual Reasoning	1	25.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Vision and Language Pre-Trained Models