Efficient ViTs
27 papers with code • 3 benchmarks • 0 datasets
Increasing the efficiency of ViTs without the modification of the architecture. (i.e., Key & Query Sparsification, Token pruning & merging)
Most implemented papers
Training data-efficient image transformers & distillation through attention
In this work, we produce a competitive convolution-free transformer by training on Imagenet only.
All Tokens Matter: Token Labeling for Training Better Vision Transformers
In this paper, we present token labeling -- a new training objective for training high-performance vision transformers (ViTs).
Fast Vision Transformers with HiLo Attention
Therefore, we propose to disentangle the high/low frequency patterns in an attention layer by separating the heads into two groups, where one group encodes high frequencies via self-attention within each local window, and another group encodes low frequencies by performing global attention between the average-pooled low-frequency keys and values from each window and each query position in the input feature map.
Token Merging: Your ViT But Faster
Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2. 2x the throughput of ViT-L on video with only a 0. 2-0. 3% accuracy drop in each case.
Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference
Vision Transformers (ViTs) have shown impressive performance but still require a high computation cost as compared to convolutional neural networks (CNNs), one reason is that ViTs' attention measures global similarities and thus has a quadratic complexity with the number of input tokens.
Pruning Self-attentions into Convolutional Layers in Single Path
Relying on the single-path space, we introduce learnable binary gates to encode the operation choices in MSA layers.
Scalable Vision Transformers with Hierarchical Pooling
However, the routine of the current ViT model is to maintain a full-length patch sequence during inference, which is redundant and lacks hierarchical representation.
Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations
Second, by maintaining the same computational cost, our method empowers ViTs to take more image tokens as input for recognition accuracy improvement, where the image tokens are from higher resolution images.
MDViT: Multi-domain Vision Transformer for Small Medical Image Segmentation Datasets
Naivly combining datasets from different domains can result in negative knowledge transfer (NKT), i. e., a decrease in model performance on some domains with non-negligible inter-domain heterogeneity.
PPT: Token Pruning and Pooling for Efficient Vision Transformers
Vision Transformers (ViTs) have emerged as powerful models in the field of computer vision, delivering superior performance across various vision tasks.