InterBERT

Introduced by Lin et al. in InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining

InterBERT aims to model interaction between information flows pertaining to different modalities. This new architecture builds multi-modal interaction and preserves the independence of single modal representation. InterBERT is built with an image embedding layer, a text embedding layer, a single-stream interaction module, and a two stream extraction module. The model is pre-trained with three tasks: 1) masked segment modeling, 2) masked region modeling, and 3) image-text matching.

Source: InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining

Read Paper

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Image Retrieval	1	20.00%
Image-text matching	1	20.00%
Retrieval	1	20.00%
Text Matching	1	20.00%
Visual Commonsense Reasoning	1	20.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Vision and Language Pre-Trained Models