ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

27 Feb 2024  ยท  Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, Kaisheng Ma ยท

This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++ that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing ReCon++ as the 3D point cloud input encoder for LLMs, ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet. ReCon++ and ShapeLLM achieve state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding. Project page: https://qizekun.github.io/shapellm/

PDF Abstract

Datasets


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
3D Question Answering (3D-QA) 3D MM-Vet ShapeLLM-13B Overall Accuracy 53.1 # 1
3D Question Answering (3D-QA) 3D MM-Vet ShapeLLM-7B Overall Accuracy 47.4 # 2
Zero-Shot Transfer 3D Point Cloud Classification ModelNet40 ReCon++ Accuracy (%) 87.3 # 3
3D Point Cloud Linear Classification ModelNet40 ReCon++ Overall Accuracy 93.6 # 2
Generative 3D Object Classification ModelNet40 ShapeLLM-7B ModelNet40 (Average) 53.08 # 2
Generative 3D Object Classification ModelNet40 ShapeLLM-13B ModelNet40 (Average) 52.96 # 3
3D Point Cloud Classification ModelNet40 ReCon++ Overall Accuracy 95.0 # 3
Few-Shot 3D Point Cloud Classification ModelNet40 10-way (10-shot) ReCon++ Overall Accuracy 94.5 # 2
Standard Deviation 4.1 # 14
Few-Shot 3D Point Cloud Classification ModelNet40 10-way (20-shot) ReCon++ Overall Accuracy 96.5 # 1
Standard Deviation 3.0 # 11
Few-Shot 3D Point Cloud Classification ModelNet40 5-way (10-shot) ReCon++ Overall Accuracy 98.0 # 1
Standard Deviation 2.3 # 12
Few-Shot 3D Point Cloud Classification ModelNet40 5-way (20-shot) ReCon++ Overall Accuracy 99.5 # 1
Standard Deviation 0.8 # 1
Generative 3D Object Classification Objaverse ShapeLLM-7B Objaverse (Average) 54.50 # 2
3D Object Captioning Objaverse ShapeLLM-7B GPT-4 46.92 # 4
Sentence-BERT 48.20 # 3
SimCSE 49.23 # 3
Generative 3D Object Classification Objaverse ShapeLLM-13B Objaverse (Average) 54.00 # 3
3D Object Captioning Objaverse ShapeLLM-13B GPT-4 48.94 # 2
Sentence-BERT 48.52 # 2
SimCSE 49.98 # 2
Zero-shot 3D classification Objaverse LVIS ReCon++ Top 1 Accuracy 53.7 # 3
Zero-Shot Transfer 3D Point Cloud Classification ScanObjectNN ReCon++ OBJ_ONLY Accuracy(%) 65.4 # 1
3D Point Cloud Classification ScanObjectNN ReCon++ Overall Accuracy 95.25 # 3
OBJ-BG (OA) 97.59 # 1
OBJ-ONLY (OA) 98.80 # 1

Methods


No methods listed for this paper. Add relevant methods here