OSCAR is a new learning method that uses object tags detected in images as anchor points to ease the learning of image-text alignment. The model take a triple as input (word-tag-region) and pre-trained with two losses (masked token loss over words and tags, and a contrastive loss between tags and others). OSCAR represents an image-text pair into semantic space via dictionary lookup. Object tags are used as anchor points to align image regions with word embeddings of pre-trained language models. The model is then fine-tuned for understanding and generation tasks.
Source: Oscar: Object-Semantics Aligned Pre-training for Vision-Language TasksPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Language Modelling | 4 | 12.50% |
Image Captioning | 3 | 9.38% |
Visual Question Answering (VQA) | 3 | 9.38% |
Question Answering | 2 | 6.25% |
NER | 2 | 6.25% |
Image Generation | 1 | 3.13% |
Text-to-Image Generation | 1 | 3.13% |
Robot Manipulation | 1 | 3.13% |
Multiple Choice Question Answering (MCQA) | 1 | 3.13% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |