PointCLIP: Point Cloud Understanding by CLIP

Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition, which learns to match images with their corresponding texts in open-vocabulary settings. However, it remains under explored that whether CLIP, pre-trained by large-scale image-text pairs in 2D, can be generalized to 3D recognition. In this paper, we identify such a setting is feasible by proposing PointCLIP, which conducts alignment between CLIP-encoded point cloud and 3D category texts. Specifically, we encode a point cloud by projecting it into multi-view depth maps without rendering, and aggregate the view-wise zero-shot prediction to achieve knowledge transfer from 2D to 3D. On top of that, we design an inter-view adapter to better extract the global feature and adaptively fuse the few-shot knowledge learned from 3D into CLIP pre-trained in 2D. By just fine-tuning the lightweight adapter in the few-shot settings, the performance of PointCLIP could be largely improved. In addition, we observe the complementary property between PointCLIP and classical 3D-supervised networks. By simple ensembling, PointCLIP boosts baseline's performance and even surpasses state-of-the-art models. Therefore, PointCLIP is a promising alternative for effective 3D point cloud understanding via CLIP under low resource cost and data regime. We conduct thorough experiments on widely-adopted ModelNet10, ModelNet40 and the challenging ScanObjectNN to demonstrate the effectiveness of PointCLIP. The code is released at https://github.com/ZrrSkywalker/PointCLIP.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Zero-Shot Transfer 3D Point Cloud Classification ModelNet10 PointCLIP Accuracy (%) 30.23 # 4
Zero-Shot Transfer 3D Point Cloud Classification ModelNet40 PointCLIP Accuracy (%) 20.18 # 16
Training-free 3D Point Cloud Classification ModelNet40 PointCLIP Accuracy (%) 20.2 # 6
Need 3D Data? No # 1
Zero-shot 3D Point Cloud Classification ModelNet40 (Pretrained on ShapeNet) PointCLIP Accuracy (% ) 19.3 # 11
Zero-shot 3D Point Cloud Classification OmniObject3D (Pretrained on ShapeNet) PointCLIP Accuracy (% ) 0.3 # 11
Zero-shot 3D Point Cloud Classification ScanNetV2 PointCLIP Top 1 Accuracy % 6.3 # 8
Zero-shot 3D Point Cloud Classification ScanNetV2 PointCLIP w/ TP. Top 1 Accuracy % 26.1 # 5
Training-free 3D Point Cloud Classification ScanObjectNN PointCLIP Accuracy (%) 15.4 # 5
Need 3D Data? No # 1
Zero-Shot Transfer 3D Point Cloud Classification ScanObjectNN PointCLIP PB_T50_RS Accuracy (%) 15.38 # 4
OBJ_BG Accuracy(%) 21.34 # 4
OBJ_ONLY Accuracy(%) 19.28 # 10
Zero-shot 3D Point Cloud Classification ScanObjectNN (Pretrained on ShapeNet) PointCLIP Accuracy (% ) 10.5 # 11
Training-free 3D Part Segmentation ShapeNet-Part PointCLIP mIoU 31.0 # 3
Need 3D Data? No # 1
3D Open-Vocabulary Instance Segmentation STPLS3D PointCLIP AP50 02.6 # 3

Methods