Vision and Language Pre-Trained Models

Pixel-BERT is a pre-trained model trained to align image pixels with text. The end-to-end framework includes a CNN-based visual encoder and cross-modal transformers for visual and language embedding learning. This model has three parts: one fully convolutional neural network that takes pixels of an image as input, one word-level token embedding based on BERT, and a multimodal transformer for jointly learning visual and language embedding.

For language, it uses other pretraining works to use Masked Language Modeling (MLM) to predict masked tokens with surrounding text and images. For vision, it uses the random pixel sampling mechanism that makes up for the challenge of predicting pixel-level features. This mechanism is also suitable for solving overfitting issues and improving the robustness of visual features.

It applies Image-Text Matching (ITM) to classify whether an image and a sentence pair match for vision and language interaction.

Image captioning is required to understand language and visual semantics for cross-modality tasks like VQA. Region-based visual features extracted from object detection models like Faster RCNN are used for better performance in the newer version of the model.

Source: Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers


Paper Code Results Date Stars


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign