Multiscale Vision Transformer

Introduced by Fan et al. in Multiscale Vision Transformers

Multiscale Vision Transformer, or MViT, is a transformer architecture for modeling visual data such as images and videos. Unlike conventional transformers, which maintain a constant channel capacity and resolution throughout the network, Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features.

Source: Multiscale Vision Transformers

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Action Recognition	3	13.04%
Video Recognition	3	13.04%
Benchmarking	2	8.70%
Action Classification	2	8.70%
Image Classification	2	8.70%
Object Detection	2	8.70%
Video Understanding	1	4.35%
EEG	1	4.35%
Seizure prediction	1	4.35%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Vision Transformers