Recent studies on deep convolutional neural networks present a simple paradigm of architecture design, i. e., models with more MACs typically achieve better accuracy, such as EfficientNet and RegNet.
In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT).
Based on the observation that many features in SISR models are also similar to each other, we propose to use shift operation to generate the redundant features (i. e., Ghost features).
Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism.
Neural architecture search (NAS) has attracted increasing attentions in both academia and industry.
AutoAugment has been a powerful algorithm that improves the accuracy of many vision tasks, yet it is sensitive to the operator space as well as hyper-parameters, and an improper setting may degenerate network optimization.
Ranked #93 on Image Classification on ImageNet