VisualBERT aims to reuse self-attention to implicitly align elements of the input text and regions in the input image. Visual embeddings are used to model images where the representations are represented by a bounding region in an image obtained from an object detector. These visual embeddings are constructed by summing three embeddings: 1) visual feature representation, 2) a segment embedding indicate whether it is an image embedding, and 3) position embedding. Essentially, image regions and language are combined with a Transformer to allow self-attention to discover implicit alignments between language and vision. VisualBERT is trained using COCO, which consists of images paired with captions. It is pre-trained using two objectives: masked language modeling objective and sentence-image prediction task. It can then be fine-tuned on different downstream tasks.
Source: VisualBERT: A Simple and Performant Baseline for Vision and LanguagePaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Visual Question Answering (VQA) | 8 | 12.70% |
Question Answering | 6 | 9.52% |
Visual Question Answering | 6 | 9.52% |
Language Modelling | 4 | 6.35% |
Visual Reasoning | 4 | 6.35% |
Image Captioning | 3 | 4.76% |
Multimodal Deep Learning | 2 | 3.17% |
Visual Commonsense Reasoning | 2 | 3.17% |
Visual Entailment | 2 | 3.17% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |