Pixel-BERT

Introduced by Huang et al. in Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

Pixel-BERT is a pre-trained model trained to align image pixels with text. The end-to-end framework includes a CNN-based visual encoder and cross-modal transformers for visual and language embedding learning. This model has three parts: one fully convolutional neural network that takes pixels of an image as input, one word-level token embedding based on BERT, and a multimodal transformer for jointly learning visual and language embedding.

For language, it uses other pretraining works to use Masked Language Modeling (MLM) to predict masked tokens with surrounding text and images. For vision, it uses the random pixel sampling mechanism that makes up for the challenge of predicting pixel-level features. This mechanism is also suitable for solving overfitting issues and improving the robustness of visual features.

It applies Image-Text Matching (ITM) to classify whether an image and a sentence pair match for vision and language interaction.

Image captioning is required to understand language and visual semantics for cross-modality tasks like VQA. Region-based visual features extracted from object detection models like Faster RCNN are used for better performance in the newer version of the model.

Source: Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Visual Question Answering (VQA)	2	8.33%
Visual Reasoning	2	8.33%
Conditional Image Generation	1	4.17%
Factual Visual Question Answering	1	4.17%
Fairness	1	4.17%
Image Captioning	1	4.17%
Image-to-Text Retrieval	1	4.17%
Knowledge Graphs	1	4.17%
Multimodal Deep Learning	1	4.17%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Vision and Language Pre-Trained Models