Point Transformer V3: Simpler, Faster, Stronger

This paper is not motivated to seek innovation within the attention mechanism. Instead, it focuses on overcoming the existing trade-offs between accuracy and efficiency within the context of point cloud processing, leveraging the power of scale. Drawing inspiration from recent advances in 3D large-scale representation learning, we recognize that model performance is more influenced by scale than by intricate design. Therefore, we present Point Transformer V3 (PTv3), which prioritizes simplicity and efficiency over the accuracy of certain mechanisms that are minor to the overall performance after scaling, such as replacing the precise neighbor search by KNN with an efficient serialized neighbor mapping of point clouds organized with specific patterns. This principle enables significant scaling, expanding the receptive field from 16 to 1024 points while remaining efficient (a 3x increase in processing speed and a 10x improvement in memory efficiency compared with its predecessor, PTv2). PTv3 attains state-of-the-art results on over 20 downstream tasks that span both indoor and outdoor scenarios. Further enhanced with multi-dataset joint training, PTv3 pushes these results to a higher level.

PDF Abstract

Results from the Paper


 Ranked #1 on Semantic Segmentation on S3DIS (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
LIDAR Semantic Segmentation nuScenes PTv3 + PPT test mIoU 0.830 # 1
val mIoU 0.812 # 1
Semantic Segmentation S3DIS PTv3 + PPT Mean IoU 80.8 # 1
mAcc 87.7 # 2
oAcc 92.6 # 1
Number of params 24.1M # 49
Semantic Segmentation S3DIS Area5 PTv3 + PPT mIoU 74.7 # 2
oAcc 92.0 # 5
mAcc 80.1 # 2
Semantic Segmentation ScanNet PTv3 + PPT test mIoU 79.4 # 1
val mIoU 78.6 # 1
3D Semantic Segmentation ScanNet200 PTv3 + PPT val mIoU 36.0 # 1
test mIoU 39.3 # 1
3D Semantic Segmentation SemanticKITTI PTv3 + PPT test mIoU 75.5% # 1
val mIoU 72.3% # 2

Methods