Natural Language Visual Grounding
16 papers with code • 0 benchmarks • 6 datasets
Benchmarks
These leaderboards are used to track progress in Natural Language Visual Grounding
Latest papers
Localizing Moments in Long Video Via Multimodal Guidance
In this paper, we propose a method for improving the performance of natural language grounding in long videos by identifying and pruning out non-describable windows.
Belief Revision based Caption Re-ranker with Visual Semantic Information
In this work, we focus on improving the captions generated by image-caption generation systems.
TubeDETR: Spatio-Temporal Video Grounding with Transformers
We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query.
CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
We show that a baseline model based on multi-context imitation learning performs poorly on CALVIN, suggesting that there is significant room for developing innovative agents that learn to relate human language to their world models with this benchmark.
Panoptic Narrative Grounding
This paper proposes Panoptic Narrative Grounding, a spatially fine and general formulation of the natural language visual grounding problem.
Composing Pick-and-Place Tasks By Grounding Language
Controlling robots to perform tasks via natural language is one of the most challenging topics in human-robot interaction.
Panoptic Narrative Grounding
This paper proposes Panoptic Narrative Grounding, a spatially fine and general formulation of the natural language visual grounding problem.
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
ALFWorld enables the creation of a new BUTLER agent whose abstract knowledge, learned in TextWorld, corresponds directly to concrete, visually grounded actions.
A Linguistic Analysis of Visually Grounded Dialogues Based on Spatial Expressions
Recent models achieve promising results in visually grounded dialogues.
Learning Cross-modal Context Graph for Visual Grounding
To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task.