In the OneR method, model input can be one of image, text or image+text, and CMC objective is combined with the traditional image-text contrastive (ITC) loss. Masked modeling is also carried out for all three input types (i.e., image, text and multi-modal). This framework employs no modality-specific architectural component except for the initial token embedding layer, making our model generic and modality-agnostic with minimal inductive bias.
Source: Unifying Vision-Language Representation Space with Single-tower TransformerPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Anomaly Detection | 1 | 25.00% |
Object Localization | 1 | 25.00% |
Retrieval | 1 | 25.00% |
Visual Reasoning | 1 | 25.00% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |