InterBERT aims to model interaction between information flows pertaining to different modalities. This new architecture builds multi-modal interaction and preserves the independence of single modal representation. InterBERT is built with an image embedding layer, a text embedding layer, a single-stream interaction module, and a two stream extraction module. The model is pre-trained with three tasks: 1) masked segment modeling, 2) masked region modeling, and 3) image-text matching.
Source: InterBERT: Vision-and-Language Interaction for Multi-modal PretrainingPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Image Retrieval | 1 | 20.00% |
Image-text matching | 1 | 20.00% |
Retrieval | 1 | 20.00% |
Text Matching | 1 | 20.00% |
Visual Commonsense Reasoning | 1 | 20.00% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |