Unified VLP

Introduced by Zhou et al. in Unified Vision-Language Pre-Training for Image Captioning and VQA

Unified VLP is unified encoder-decoder model for general vision-language pre-training. The models uses a shared multi-layer transformers network for both encoding and decoding. The model is pre-trained on large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. Model architecture for pre-training. For pre-training , the input comprises of image input, sentence input, and three special tokens ([CLS], [SEP], [STOP]). The image is processed as $N$ Region of Interests (RoIs) and region features are extracted. The sentence is tokenized and masked with [MASK] tokens for the later masked language modeling task. The model consists of 12 layers of Transformer blocks, each having a masked self-attention layer and feed-forward module, where the self-attention mask controls what input context the prediction conditions on. Two self-attention masks are implemented depending on whether the objective is bidirectional or seq2seq. The model is fine-tuned for image captioning and visual question answering.

Source: Unified Vision-Language Pre-Training for Image Captioning and VQA

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Image Captioning	1	20.00%
Question Answering	1	20.00%
Text Generation	1	20.00%
Visual Question Answering	1	20.00%
Visual Question Answering (VQA)	1	20.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Vision and Language Pre-Trained Models