Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model

8 Aug 2022  ·  Di Wang, Qiming Zhang, Yufei Xu, Jing Zhang, Bo Du, DaCheng Tao, Liangpei Zhang ·

Large-scale vision foundation models have made significant progress in visual tasks on natural images, with vision transformers being the primary choice due to their good scalability and representation ability. However, large-scale models in remote sensing (RS) have not yet been sufficiently explored. In this paper, we resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models tailored to RS tasks and investigate how such large models perform. To handle the large sizes and objects of arbitrary orientations in RS images, we propose a new rotated varied-size window attention to replace the original full attention in transformers, which can significantly reduce the computational cost and memory footprint while learning better object representation by extracting rich context from the generated diverse windows. Experiments on detection tasks show the superiority of our model over all state-of-the-art models, achieving 81.24% mAP on the DOTA-V1.0 dataset. The results of our models on downstream classification and segmentation tasks also show competitive performance compared to existing advanced methods. Further experiments show the advantages of our models in terms of computational complexity and data efficiency in transferring.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Aerial Scene Classification AID (20% as trainset) ViTAE-B + RVSA Accuracy 97.03 # 1
Aerial Scene Classification AID (20% as trainset) ViT-B + RVSA Accuracy 96.92 # 2
Aerial Scene Classification AID (50% as trainset) ViT-B + RVSA Accuracy 98.44 # 2
Aerial Scene Classification AID (50% as trainset) ViTAE-B + RVSA Accuracy 98.50 # 1
Object Detection In Aerial Images DIOR-R ViT-B + RVSA-ORCN mAP 70.85 # 3
Object Detection In Aerial Images DIOR-R ViTAE-B + RVSA-ORCN mAP 71.05 # 2
Object Detection In Aerial Images DOTA ViT-B + RVSA-ORCN mAP 81.01% # 7
Object Detection In Aerial Images DOTA ViTAE-B + RVSA-ORCN mAP 81.24% # 6
Semantic Segmentation iSAID ViTAE-B + RVSA-UperNet mIoU 64.49 # 14
Semantic Segmentation iSAID ViT-B + RVSA-UperNet mIoU 63.85 # 17
Semantic Segmentation ISPRS Potsdam ViTAE-B + RVSA -UperNet Overall Accuracy 91.22 # 11
Semantic Segmentation ISPRS Potsdam ViT-B + RVSA-UperNet Overall Accuracy 90.77 # 15
Semantic Segmentation LoveDA ViT-B + RVSA-UperNet Category mIoU 51.95 # 8
Semantic Segmentation LoveDA ViTAE-B + RVSA-UperNet Category mIoU 52.44 # 6
Aerial Scene Classification NWPU (10% as trainset) ViT-B + RVSA Accuracy 93.79 # 5
Aerial Scene Classification NWPU (10% as trainset) ViTAE-B + RVSA Accuracy 93.93 # 2
Aerial Scene Classification NWPU (20% as trainset) ViTAE-B + RVSA Accuracy 95.69 # 1
Aerial Scene Classification NWPU (20% as trainset) ViT-B + RVSA Accuracy 95.49 # 3
Aerial Scene Classification UCM (50% as trainset) ViT-B + RVSA Accuracy 99.70 # 1
Aerial Scene Classification UCM (50% as trainset) ViTAE-B + RVSA Accuracy 99.56 # 2

Methods


No methods listed for this paper. Add relevant methods here