Vision and Language Pre-Trained Models

InterBERT aims to model interaction between information flows pertaining to different modalities. This new architecture builds multi-modal interaction and preserves the independence of single modal representation. InterBERT is built with an image embedding layer, a text embedding layer, a single-stream interaction module, and a two stream extraction module. The model is pre-trained with three tasks: 1) masked segment modeling, 2) masked region modeling, and 3) image-text matching.

Source: InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining


Paper Code Results Date Stars


Task Papers Share
Image Retrieval 1 20.00%
Image-text matching 1 20.00%
Retrieval 1 20.00%
Text Matching 1 20.00%
Visual Commonsense Reasoning 1 20.00%


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign