Vision and Language Pre-Trained Models

VisualBERT aims to reuse self-attention to implicitly align elements of the input text and regions in the input image. Visual embeddings are used to model images where the representations are represented by a bounding region in an image obtained from an object detector. These visual embeddings are constructed by summing three embeddings: 1) visual feature representation, 2) a segment embedding indicate whether it is an image embedding, and 3) position embedding. Essentially, image regions and language are combined with a Transformer to allow self-attention to discover implicit alignments between language and vision. VisualBERT is trained using COCO, which consists of images paired with captions. It is pre-trained using two objectives: masked language modeling objective and sentence-image prediction task. It can then be fine-tuned on different downstream tasks.

Source: VisualBERT: A Simple and Performant Baseline for Vision and Language


Paper Code Results Date Stars


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign