Natural Language Visual Grounding
16 papers with code • 0 benchmarks • 6 datasets
These leaderboards are used to track progress in Natural Language Visual Grounding
We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks.
We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.
The Vision-and-Language Navigation (VLN) task entails an agent following navigational instruction in photo-realistic unknown environments.
Computer Vision applications often require a textual grounding module with precision, interpretability, and resilience to counterfactual inputs/queries.
Especially in ambiguous settings, humans prefer expressions (called relational referring expressions) that describe an object with respect to a distinguishing, unique object.
To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task.
ALFWorld enables the creation of a new BUTLER agent whose abstract knowledge, learned in TextWorld, corresponds directly to concrete, visually grounded actions.