Natural Language Visual Grounding

16 papers with code • 0 benchmarks • 6 datasets

This task has no description! Would you like to contribute one?

Most implemented papers

ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

askforalfred/alfred CVPR 2020

We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks.

Grounding of Textual Phrases in Images by Reconstruction

akirafukui/vqa-mcb 12 Nov 2015

We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

chihyaoma/selfmonitoring-agent ICLR 2019

The Vision-and-Language Navigation (VLN) task entails an agent following navigational instruction in photo-realistic unknown environments.

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

alfworld/alfworld 8 Oct 2020

ALFWorld enables the creation of a new BUTLER agent whose abstract knowledge, learned in TextWorld, corresponds directly to concrete, visually grounded actions.

Composing Pick-and-Place Tasks By Grounding Language

mees/AIS-Alexa-Robot 16 Feb 2021

Controlling robots to perform tasks via natural language is one of the most challenging topics in human-robot interaction.

Robust Change Captioning

Seth-Park/RobustChangeCaptioning ICCV 2019

We present a novel Dual Dynamic Attention Model (DUDA) to perform robust Change Captioning.

Modularized Textual Grounding for Counterfactual Resilience

jacobswan1/MTG-pytorch CVPR 2019

Computer Vision applications often require a textual grounding module with precision, interpretability, and resilience to counterfactual inputs/queries.

Searching for Ambiguous Objects in Videos using Relational Referring Expressions

hazananayurt/viref 3 Aug 2019

Especially in ambiguous settings, humans prefer expressions (called relational referring expressions) that describe an object with respect to a distinguishing, unique object.

Learning Cross-modal Context Graph for Visual Grounding

youngfly11/LCMCG-PyTorch AAAI-2020 2020

To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task.