Vision-and-Language BERT

Introduced by Lu et al. in ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Vision-and-Language BERT (ViLBERT) is a BERT-based model for learning task-agnostic joint representations of image content and natural language. ViLBERT extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.

Source: ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Visual Question Answering (VQA)	9	9.47%
Visual Question Answering	8	8.42%
Question Answering	7	7.37%
Retrieval	7	7.37%
Image Captioning	4	4.21%
Visual Commonsense Reasoning	4	4.21%
Referring Expression	3	3.16%
Language Modelling	3	3.16%
Visual Dialog	3	3.16%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Representation Learning

Transformers

Vision and Language Pre-Trained Models