Natural Language Visual Grounding
16 papers with code • 0 benchmarks • 6 datasets
Benchmarks
These leaderboards are used to track progress in Natural Language Visual Grounding
Most implemented papers
ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks.
Grounding of Textual Phrases in Images by Reconstruction
We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.
Self-Monitoring Navigation Agent via Auxiliary Progress Estimation
The Vision-and-Language Navigation (VLN) task entails an agent following navigational instruction in photo-realistic unknown environments.
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
ALFWorld enables the creation of a new BUTLER agent whose abstract knowledge, learned in TextWorld, corresponds directly to concrete, visually grounded actions.
Composing Pick-and-Place Tasks By Grounding Language
Controlling robots to perform tasks via natural language is one of the most challenging topics in human-robot interaction.
Robust Change Captioning
We present a novel Dual Dynamic Attention Model (DUDA) to perform robust Change Captioning.
Modularized Textual Grounding for Counterfactual Resilience
Computer Vision applications often require a textual grounding module with precision, interpretability, and resilience to counterfactual inputs/queries.
Searching for Ambiguous Objects in Videos using Relational Referring Expressions
Especially in ambiguous settings, humans prefer expressions (called relational referring expressions) that describe an object with respect to a distinguishing, unique object.
Learning Cross-modal Context Graph for Visual Grounding
To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task.
A Linguistic Analysis of Visually Grounded Dialogues Based on Spatial Expressions
Recent models achieve promising results in visually grounded dialogues.