Efficient ViTs

26 papers with code • 3 benchmarks • 0 datasets

Increasing the efficiency of ViTs without the modification of the architecture. (i.e., Key & Query Sparsification, Token pruning & merging)

Benchmarks

Add a Result

These leaderboards are used to track progress in Efficient ViTs

Dataset	Best Model	Compare
ImageNet-1K (with DeiT-S)	MCTF ($r=16$)	See all
ImageNet-1K (with DeiT-T)	dTPS	See all
ImageNet-1K (With LV-ViT-S)	MCTF ($r=8$)	See all

Most implemented papers

Most implemented Social Latest No code

Training data-efficient image transformers & distillation through attention

facebookresearch/deit • • 23 Dec 2020

In this work, we produce a competitive convolution-free transformer by training on Imagenet only.

Paper
Code

All Tokens Matter: Token Labeling for Training Better Vision Transformers

zihangJiang/TokenLabeling • • NeurIPS 2021

In this paper, we present token labeling -- a new training objective for training high-performance vision transformers (ViTs).

Paper
Code

Fast Vision Transformers with HiLo Attention

ziplab/litv2 • • 26 May 2022

Therefore, we propose to disentangle the high/low frequency patterns in an attention layer by separating the heads into two groups, where one group encodes high frequencies via self-attention within each local window, and another group encodes low frequencies by performing global attention between the average-pooled low-frequency keys and values from each window and each query position in the input feature map.

Paper
Code

Pruning Self-attentions into Convolutional Layers in Single Path

zhuang-group/spvit • • 23 Nov 2021

Relying on the single-path space, we introduce learnable binary gates to encode the operation choices in MSA layers.

Paper
Code

Token Merging: Your ViT But Faster

facebookresearch/tome • • 17 Oct 2022

Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2. 2x the throughput of ViT-L on video with only a 0. 2-0. 3% accuracy drop in each case.

Paper
Code

Scalable Vision Transformers with Hierarchical Pooling

MonashAI/HVT • • ICCV 2021

However, the routine of the current ViT model is to maintain a full-length patch sequence during inference, which is redundant and lacks hierarchical representation.

Paper
Code

PPT: Token Pruning and Pooling for Efficient Vision Transformers

mindspore-lab/models • • 3 Oct 2023

Vision Transformers (ViTs) have emerged as powerful models in the field of computer vision, delivering superior performance across various vision tasks.

Paper
Code

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

raoyongming/DynamicViT • • NeurIPS 2021

Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input.

Paper
Code

Chasing Sparsity in Vision Transformers: An End-to-End Exploration

VITA-Group/SViTE • • NeurIPS 2021

For example, our sparsified DeiT-Small at (5%, 50%) sparsity for (data, architecture), improves 0. 28% top-1 accuracy, and meanwhile enjoys 49. 32% FLOPs and 4. 40% running time savings.

Paper
Code

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

YifanXu74/Evo-ViT • • 3 Aug 2021

Vision transformers (ViTs) have recently received explosive popularity, but the huge computational cost is still a severe issue.

Paper
Code

Efficient ViTs

Benchmarks Add a Result

Most implemented papers

Content

Benchmarks

Add a Result