Multiscale Attention ViT with Late fusion

Introduced by Maaz et al. in Class-agnostic Object Detection with Multi-modal Transformer

Multiscale Attention ViT with Late fusion (MAVL) is a multi-modal network, trained with aligned image-text pairs, capable of performing targeted detection using human understandable natural language text queries. It utilizes multi-scale image features and uses deformable convolutions with late multi-modal fusion. The authors demonstrate excellent ability of MAVL as class-agnostic object detector when queried using general human understandable natural language command, such as "all objects", "all entities", etc.

Source: Class-agnostic Object Detection with Multi-modal Transformer

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Open Vocabulary Attribute Detection	1	14.29%
Open Vocabulary Object Detection	1	14.29%
Zero-Shot Object Detection	1	14.29%
Class-agnostic Object Detection	1	14.29%
Object Detection	1	14.29%
Object Proposal Generation	1	14.29%
Open World Object Detection	1	14.29%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Multi-Modal Methods