MDETR

Introduced by Kamath et al. in MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

MDETR is an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. It utilizes a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. The network is pre-trained on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. The network is then fine-tuned on several downstream tasks such as phrase grounding, referring expression comprehension and segmentation.

Source: MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Visual Question Answering (VQA)	4	12.50%
Question Answering	3	9.38%
Visual Question Answering	3	9.38%
Referring Expression	3	9.38%
Visual Grounding	3	9.38%
Phrase Grounding	2	6.25%
Referring Expression Segmentation	2	6.25%
Language Modelling	1	3.13%
Action Detection	1	3.13%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Object Detection Models