SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Spatial intelligence is a critical component of embodied AI, promoting robots to understand and interact with their environments. While recent advances have enhanced the ability of VLMs to perceive object locations and positional relationships, they still lack the capability to precisely understand object orientations-a key requirement for tasks involving fine-grained manipulations. Addressing this limitation not only requires geometric reasoning but also an expressive and intuitive way to represent orientation. In this context, we propose that natural language offers a more flexible representation space than canonical frames, making it particularly suitable for instruction-following robotic systems. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the ''handle'' direction of a knife). To support this, we construct OrienText300K, a large-scale dataset of 3D models annotated with semantic orientations that link geometric understanding to functional semantics. By integrating semantic orientation into a VLM system, we enable robots to generate manipulation actions with both positional and orientational constraints. Extensive experiments in simulation and real world demonstrate that our approach significantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy on Open6DOR and 74.9% accuracy on SIMPLER.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Spatial Reasoning 6-DoF SpatialBench SoFar Total 43.9 # 1
Position-rel 59.6 # 1
Position-abs 33.8 # 1
Orientation-rel 54.6 # 1
Orientation-abs 31.3 # 1
Spatial Reasoning EmbSpatial-Bench SoFar Generation 70.88 # 1
Object Rearrangement Open6DOR V2 SoFar pos-level0 96.0 # 1
6-DoF 48.7 # 1
pos-level1 81.5 # 1
rot-level0 68.6 # 1
rot-level1 42.2 # 1
rot-level2 70.1 # 1
Robot Manipulation SimplerEnv-Google Robot SoFar Visual Matching-Pick Coke Can 0.923 # 1
Visual Matching-Move Near 0.917 # 1
Visual Matching 0.749 # 1
Visual Matching-Open/Close Drawer 0.403 # 4
Variant Aggregation 0.676 # 2
Variant Aggregation-Pick Coke Can 0.907 # 1
Variant Aggregation-Move Near 0.740 # 2
Variant Aggregation-Open/Close Drawer 0.297 # 5
Robot Manipulation SimplerEnv-Widow X SoFar Average 0.583 # 1
Put Spoon on Towel 0.583 # 1
Put Carrot on Plate 0.667 # 1
Stack Green Block on Yellow Block 0.708 # 1
Put Eggplant in Yellow Basket 0.375 # 1

Methods


No methods listed for this paper. Add relevant methods here