MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of objects and attributes... (read more)

PDF Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Visual Question Answering CLEVR MDETR Accuracy 99.7 # 2
Visual Question Answering CLEVR-Humans MDETR Accuracy 81.7 # 1
Referring Expression Comprehension CLEVR-Ref+ MDETR Accuracy 100 # 1
Phrase Grounding Flickr30k Entities Test MDETR-ENB5 R@1 84.3 # 1
R@10 95.8 # 1
R@5 93.9 # 1
Visual Question Answering GQA test-std MDETR-ENB5 Accuracy 62.45 # 2
Referring Expression Segmentation PhraseCut MDETR ENB3 Mean IoU 53.7 # 1
Pr@0.5 57.5 # 1
Pr@0.7 39.9 # 1
Pr@0.9 11.9 # 1
Referring Expression Comprehension RefCoco MDETR-ENB3 Val 87.51 # 1
Test A 90.4 # 1
Test B 82.67 # 1
Referring Expression Comprehension RefCoco+ MDETR-ENB3 Val 81.13 # 1
Test A 85.52 # 1
Test B 72.96 # 1
Referring Expression Comprehension RefCOCOg-test MDETR ENB3 Accuracy 83.31 # 1
Referring Expression Comprehension RefCOCOg-val MDETR ENB3 Accuracy 83.35 # 1

Methods used in the Paper


METHOD TYPE
🤖 No Methods Found Help the community by adding them if they're not listed; e.g. Deep Residual Learning for Image Recognition uses ResNet