MDETR is an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. It utilizes a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. The network is pre-trained on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. The network is then fine-tuned on several downstream tasks such as phrase grounding, referring expression comprehension and segmentation.
Source: MDETR -- Modulated Detection for End-to-End Multi-Modal UnderstandingPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Visual Question Answering (VQA) | 4 | 12.12% |
Question Answering | 3 | 9.09% |
Visual Question Answering | 3 | 9.09% |
Referring Expression | 3 | 9.09% |
Visual Grounding | 3 | 9.09% |
Phrase Grounding | 2 | 6.06% |
Referring Expression Segmentation | 2 | 6.06% |
Language Modelling | 1 | 3.03% |
Decoder | 1 | 3.03% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |