Visual-Linguistic BERT

Introduced by Su et al. in VL-BERT: Pre-training of Generic Visual-Linguistic Representations

VL-BERT is pre-trained on a large-scale image-captions dataset together with text-only corpus. The input to the model are either words from the input sentences or regions-of-interest (RoI) from input images. It can be fine-tuned to fit most visual-linguistic downstream tasks. Its backbone is a multi-layer bidirectional Transformer encoder, modified to accommodate visual contents, and new type of visual feature embedding to the input feature embeddings. VL-BERT takes both visual and linguistic elements as input, represented as RoIs in images and subwords in input sentences. Four different types of embeddings are used to represent each input: token embedding, visual feature embedding, segment embedding, and sequence position embedding. VL-BERT is pre-trained using Conceptual Captions and text-only datasets. Two pre-training tasks are used: masked language modeling with visual clues, and masked RoI classification with linguistic clues.

Source: VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Image Captioning	1	7.69%
Machine Translation	1	7.69%
Multimodal Machine Translation	1	7.69%
Text Generation	1	7.69%
Translation	1	7.69%
Image-text matching	1	7.69%
Language Modelling	1	7.69%
Question Answering	1	7.69%
Referring Expression	1	7.69%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Vision and Language Pre-Trained Models