Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer, which significantly enhances the ViT of \cite{dosovitskiy2020image} for encoding high-resolution images using two techniques. The first is the multi-scale model structure, which provides image encodings at multiple scales with manageable computational cost... (read more)

PDF Abstract

Datasets


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Object Detection COCO minival RetinaNet (ViL-Base) box AP 44.3 # 43
AP50 65.5 # 18
AP75 47.1 # 27
APS 28.9 # 10
APM 47.9 # 20
APL 58.3 # 26
Object Detection COCO minival RetinaNet (ViL-Base, multi-scale, 3x) box AP 44.7 # 36
AP75 47.6 # 24
APS 29.9 # 8
APM 48 # 19
APL 58.1 # 27
Instance Segmentation COCO minival Mask R-CNN (ViL Base, 1x lr) mask AP 45.1 # 7
AP50 67.2 # 2
AP75 49.3 # 2
APL 44.2 # 9
APM 64.3 # 2
APS 41 # 2
Instance Segmentation COCO minival Mask R-CNN (ViL Base, multi-scale, 3x lr) mask AP 45.7 # 6
AP75 49.9 # 1
APL 44.5 # 8
APM 64.4 # 1
APS 41.3 # 1
Image Classification ImageNet ViL-Medium-D Top 1 Accuracy 83.3% # 102
Number of params 39.7M # 108
Image Classification ImageNet ViL-Medium-W Top 1 Accuracy 82.9% # 117
Number of params 39.8M # 107
Image Classification ImageNet ViL-Small Top 1 Accuracy 82% # 151
Number of params 24.6M # 134
Image Classification ImageNet ViL-Base-W Top 1 Accuracy 81.9% # 155
Number of params 79M # 58
Image Classification ImageNet ViL-Base-D Top 1 Accuracy 83.2% # 107
Number of params 55.7M # 84
Image Classification ImageNet ViL-Tiny Top 1 Accuracy 76.3% # 293
Number of params 6.7M # 177

Methods used in the Paper


METHOD TYPE
Average Pooling
Pooling Operations
1x1 Convolution
Convolutions
Batch Normalization
Normalization
Global Average Pooling
Pooling Operations
Bottleneck Residual Block
Skip Connection Blocks
Adam
Stochastic Optimization
Linear Warmup With Linear Decay
Learning Rate Schedules
ReLU
Activation Functions
Dropout
Regularization
Layer Normalization
Normalization
Residual Connection
Skip Connections
Sliding Window Attention
Attention Patterns
Label Smoothing
Regularization
Convolution
Convolutions
AdamW
Stochastic Optimization
BPE
Subword Segmentation
Multi-Head Attention
Attention Modules
Max Pooling
Pooling Operations
Vision Transformer
Image Models
Global and Sliding Window Attention
Attention Patterns
Transformer
Transformers
Attention Dropout
Regularization
Kaiming Initialization
Initialization
Residual Block
Skip Connection Blocks
ResNet
Convolutional Neural Networks
Weight Decay
Regularization
WordPiece
Subword Segmentation
GELU
Activation Functions
Dilated Sliding Window Attention
Attention Patterns
Softmax
Output Functions
Scaled Dot-Product Attention
Attention Mechanisms
Dense Connections
Feedforward Networks
Longformer
Transformers