Vision Transformers

Multiscale Vision Transformer

Introduced by Fan et al. in Multiscale Vision Transformers

Multiscale Vision Transformer, or MViT, is a transformer architecture for modeling visual data such as images and videos. Unlike conventional transformers, which maintain a constant channel capacity and resolution throughout the network, Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features.

Source: Multiscale Vision Transformers

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Action Recognition 3 13.04%
Video Recognition 3 13.04%
Benchmarking 2 8.70%
Action Classification 2 8.70%
Image Classification 2 8.70%
Object Detection 2 8.70%
Video Understanding 1 4.35%
EEG 1 4.35%
Seizure prediction 1 4.35%

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories