Class-agnostic Object Detection with Multi-modal Transformer

What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and novel objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics. For the first time in literature, we demonstrate that Multi-modal Vision Transformers (MViT) trained with aligned image-text pairs can effectively bridge this gap. Our extensive experiments across various domains and novel objects show the state-of-the-art performance of MViTs to localize generic objects in images. Based on the observation that existing MViTs do not include multi-scale feature processing and usually require longer training schedules, we develop an efficient MViT architecture using multi-scale deformable attention and late vision-language fusion. We show the significance of MViT proposals in a diverse range of applications including open-world object detection, salient and camouflage object detection, supervised and self-supervised detection tasks. Further, MViTs can adaptively generate proposals given a specific language query and thus offer enhanced interactability. Code: \url{https://git.io/J1HPY}.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Class-agnostic Object Detection COCO MDef-DETR AP50 43.64 (COCO dataset is not included in training) # 1
Object Proposal Generation COCO MDef-DETR (Off-the-shelf evaluation) Average Recall 0.6503 # 1
Open World Object Detection COCO 2017 (Electronic, Indoor, Kitchen, Furniture) ORE (MDef-DETR) MAP 31.66 # 1
Open World Object Detection COCO 2017 (Outdoor, Accessories, Appliance, Truck) ORE (MDef-DETR) A-OSE 5212 # 1
WI 0.0251 # 2
MAP 46.19 # 1
Unknown Recall 49.54 # 1
Open World Object Detection COCO 2017 (Sports, Food) ORE (MDef-DETR) WI 0.0179 # 2
A-OSE 4117 # 1
MAP 36.75 # 1
Unknown Recall 50.89 # 1
Class-agnostic Object Detection Comic2k MDef-DETR AP50 57.72 (Comic Dataset is not included in training) # 1
Object Proposal Generation Comic2k MDef-DETR Average Recall 0.8982 (Off-the-shelf evaluation) # 1
Class-agnostic Object Detection Kitchen Scenes MDef-DETR AP50 45.43 (Kitchen Dataset is not included in training) # 1
Class-agnostic Object Detection KITTI MDef-DETR AP50 48.22 (KITTI Dataset is not included in training) # 1
Object Proposal Generation KITTI MDef-DETR Average Recall 0.6353 (Off-the-shelf evaluation) # 1
Class-agnostic Object Detection PASCAL VOC MDef-DETR AP50 68.59 (VOC Dataset is not included in training) # 1
Object Detection PASCAL VOC 10% DETReg (MDef-DETR) AP 58.78 # 1
AP50 80.46 # 1
AP75 65.65 # 1
Open World Object Detection PASCAL VOC 2007 ORE (MDef-DETR) WI 0.0474 # 2
A-OSE 7322 # 1
MAP 64.03 # 1
Unknown Recall 50.13 # 1
Object Detection PASCAL VOC 2007 DETReg (MDef-DETR) MAP 84.16% # 2
AP50 84.16 # 1
Object Proposal Generation PASCAL VOC 2012, 60 proposals per image MDef-DETR Average Recall 0.9126 # 1

Methods