3D visual grounding
39 papers with code • 0 benchmarks • 2 datasets
Benchmarks
These leaderboards are used to track progress in 3D visual grounding
Most implemented papers
ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance
In this paper, we propose ViewRefer, a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities.
EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding
3D visual grounding aims to find the object within point clouds mentioned by free-form natural language descriptions with rich semantic cues.
Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images
Grounding referring expressions in RGBD image has been an emerging field.
InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring
Compared with the visual grounding on 2D images, the natural-language-guided 3D object localization on point clouds is more challenging.
SAT: 2D Semantics Assisted Training for 3D Visual Grounding
3D visual grounding aims at grounding a natural language description about a 3D scene, usually represented in the form of 3D point clouds, to the targeted object region.
Multi-View Transformer for 3D Visual Grounding
The multi-view space enables the network to learn a more robust multi-modal representation for 3D visual grounding and eliminates the dependence on specific views.
3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection
3D visual grounding aims to locate the referred target object in 3D point cloud scenes according to a free-form language description.
Learning Point-Language Hierarchical Alignment for 3D Visual Grounding
This paper presents a novel hierarchical alignment model (HAM) that learns multi-granularity visual and linguistic representations in an end-to-end manner.
Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding
The main question we address in this paper is "can we consolidate the 3D visual stream by 2D clues synthesized from point clouds and efficiently utilize them in training and testing?".
Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training
The current approaches for 3D visual reasoning are task-specific, and lack pre-training methods to learn generic representations that can transfer across various tasks.