Multiscale Vision Transformer, or MViT, is a transformer architecture for modeling visual data such as images and videos. Unlike conventional transformers, which maintain a constant channel capacity and resolution throughout the network, Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features.
Source: Multiscale Vision TransformersPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Action Recognition | 3 | 13.04% |
Video Recognition | 3 | 13.04% |
Benchmarking | 2 | 8.70% |
Action Classification | 2 | 8.70% |
Image Classification | 2 | 8.70% |
Object Detection | 2 | 8.70% |
Video Understanding | 1 | 4.35% |
EEG | 1 | 4.35% |
Seizure prediction | 1 | 4.35% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |