Vision and Language Pre-Trained Models

Vision-Language pretrained Model

Introduced by Bao et al. in VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

VLMo is a unified vision-language pre-trained model that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. A Mixture-of-Modality-Experts (MOME) transformer is introduced to encode different modalities which helps it to capture modality-specific information by modality experts, and align content of different modalities by the self-attention module shared across modalities. The model parameters are shared across image-text contrastive learning, masked language modeling, and image-text matching tasks. During fine-tuning, the flexible modeling allows for VLMO to be used as either a dual encoder (i.e., separately encode images and text for retrieval tasks) or a fusion encoder (i.e., jointly encode image-text pairs for better interaction across modalities) Stage-wise pretraining on image-only and text-only data improved the vision-language pre-trained model. The model can be used for classification tasks and fine-tuned as a dual encoder for retrieval tasks.

Source: VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts


Paper Code Results Date Stars


Task Papers Share
Image Retrieval 1 25.00%
Retrieval 1 25.00%
Visual Question Answering (VQA) 1 25.00%
Visual Reasoning 1 25.00%


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign