Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

14 Apr 2023  ·  Yu-Qi Yang, Yu-Xiao Guo, Jian-Yu Xiong, Yang Liu, Hao Pan, Peng-Shuai Wang, Xin Tong, Baining Guo ·

The use of pretrained backbones with fine-tuning has been successful for 2D vision and natural language processing tasks, showing advantages over task-specific networks. In this work, we introduce a pretrained 3D backbone, called {\SST}, for 3D indoor scene understanding. We design a 3D Swin transformer as our backbone network, which enables efficient self-attention on sparse voxels with linear memory complexity, making the backbone scalable to large models and datasets. We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance. We pretrained a large {\SST} model on a synthetic Structured3D dataset, which is an order of magnitude larger than the ScanNet dataset. Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets, but also outperforms state-of-the-art methods on downstream tasks with +2.3 mIoU and +2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation, +1.8 mIoU on ScanNet segmentation (val), +1.9 mAP@0.5 on ScanNet detection, and +8.1 mAP@0.5 on S3DIS detection. A series of extensive ablation studies further validate the scalability, generality, and superior performance enabled by our approach. The code and models are available at https://github.com/microsoft/Swin3D .

PDF Abstract

Results from the Paper


Ranked #2 on 3D Object Detection on S3DIS (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
3D Object Detection S3DIS Swin3D-L+FCAF3D mAP@0.5 54.0 # 2
mAP@0.25 72.1 # 3
Semantic Segmentation S3DIS Swin3D-L Mean IoU 79.8 # 3
mAcc 88.0 # 1
oAcc 92.4 # 3
Number of params N/A # 1
Semantic Segmentation S3DIS Area5 Swin3D-L mIoU 74.5 # 3
oAcc 92.7 # 1
mAcc 80.5 # 1
Number of params N/A # 2
Semantic Segmentation ScanNet Swin3D-L test mIoU 77.9 # 4
val mIoU 77.5 # 2
3D Object Detection ScanNetV2 Swin3D-L+CAGroup3D mAP@0.25 76.4 # 4
mAP@0.5 63.2 # 4

Methods