One Representation

Introduced by Jang et al. in Unifying Vision-Language Representation Space with Single-tower Transformer

In the OneR method, model input can be one of image, text or image+text, and CMC objective is combined with the traditional image-text contrastive (ITC) loss. Masked modeling is also carried out for all three input types (i.e., image, text and multi-modal). This framework employs no modality-specific architectural component except for the initial token embedding layer, making our model generic and modality-agnostic with minimal inductive bias.

Source: Unifying Vision-Language Representation Space with Single-tower Transformer

Read Paper

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Object Localization	1	33.33%
Retrieval	1	33.33%
Visual Reasoning	1	33.33%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Vision and Language Pre-Trained Models