Efficient ViTs

27 papers with code • 3 benchmarks • 0 datasets

Increasing the efficiency of ViTs without the modification of the architecture. (i.e., Key & Query Sparsification, Token pruning & merging)

Most implemented papers

Training data-efficient image transformers & distillation through attention

facebookresearch/deit 23 Dec 2020

In this work, we produce a competitive convolution-free transformer by training on Imagenet only.

All Tokens Matter: Token Labeling for Training Better Vision Transformers

zihangJiang/TokenLabeling NeurIPS 2021

In this paper, we present token labeling -- a new training objective for training high-performance vision transformers (ViTs).

Fast Vision Transformers with HiLo Attention

ziplab/litv2 26 May 2022

Therefore, we propose to disentangle the high/low frequency patterns in an attention layer by separating the heads into two groups, where one group encodes high frequencies via self-attention within each local window, and another group encodes low frequencies by performing global attention between the average-pooled low-frequency keys and values from each window and each query position in the input feature map.

Token Merging: Your ViT But Faster

facebookresearch/tome 17 Oct 2022

Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2. 2x the throughput of ViT-L on video with only a 0. 2-0. 3% accuracy drop in each case.

Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference

gatech-eic/castling-vit CVPR 2023

Vision Transformers (ViTs) have shown impressive performance but still require a high computation cost as compared to convolutional neural networks (CNNs), one reason is that ViTs' attention measures global similarities and thus has a quadratic complexity with the number of input tokens.

Pruning Self-attentions into Convolutional Layers in Single Path

zhuang-group/spvit 23 Nov 2021

Relying on the single-path space, we introduce learnable binary gates to encode the operation choices in MSA layers.

Scalable Vision Transformers with Hierarchical Pooling

MonashAI/HVT ICCV 2021

However, the routine of the current ViT model is to maintain a full-length patch sequence during inference, which is redundant and lacks hierarchical representation.

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

youweiliang/evit 16 Feb 2022

Second, by maintaining the same computational cost, our method empowers ViTs to take more image tokens as input for recognition accuracy improvement, where the image tokens are from higher resolution images.

MDViT: Multi-domain Vision Transformer for Small Medical Image Segmentation Datasets

siyi-wind/mdvit 5 Jul 2023

Naivly combining datasets from different domains can result in negative knowledge transfer (NKT), i. e., a decrease in model performance on some domains with non-negligible inter-domain heterogeneity.

PPT: Token Pruning and Pooling for Efficient Vision Transformers

mindspore-lab/models 3 Oct 2023

Vision Transformers (ViTs) have emerged as powerful models in the field of computer vision, delivering superior performance across various vision tasks.