VisualBERT

Introduced by Li et al. in VisualBERT: A Simple and Performant Baseline for Vision and Language

VisualBERT aims to reuse self-attention to implicitly align elements of the input text and regions in the input image. Visual embeddings are used to model images where the representations are represented by a bounding region in an image obtained from an object detector. These visual embeddings are constructed by summing three embeddings: 1) visual feature representation, 2) a segment embedding indicate whether it is an image embedding, and 3) position embedding. Essentially, image regions and language are combined with a Transformer to allow self-attention to discover implicit alignments between language and vision. VisualBERT is trained using COCO, which consists of images paired with captions. It is pre-trained using two objectives: masked language modeling objective and sentence-image prediction task. It can then be fine-tuned on different downstream tasks.

Source: VisualBERT: A Simple and Performant Baseline for Vision and Language

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Visual Question Answering (VQA)	8	14.55%
Question Answering	6	10.91%
Visual Question Answering	6	10.91%
Visual Reasoning	4	7.27%
Language Modelling	3	5.45%
Image Captioning	2	3.64%
Multimodal Deep Learning	2	3.64%
Visual Commonsense Reasoning	2	3.64%
Visual Entailment	2	3.64%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Vision and Language Pre-Trained Models