Visual Grounding

71 papers with code • 3 benchmarks • 1 datasets

This task has no description! Would you like to contribute one?


Most implemented papers

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

facebookresearch/vilbert-multi-task NeurIPS 2019

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

akirafukui/vqa-mcb EMNLP 2016

Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations.

Grounding of Textual Phrases in Images by Reconstruction

akirafukui/vqa-mcb 12 Nov 2015

We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.

Revisiting Visual Question Answering Baselines

Cold-Winter/vqs 27 Jun 2016

Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding.

Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat

shekharRavi/Beyond-Task-Success-NAACL2019 NAACL 2019

We compare our approach to an alternative system which extends the baseline with reinforcement learning.

Dual Attention Networks for Visual Reference Resolution in Visual Dialog

gicheonkang/DAN-VisDial IJCNLP 2019

Specifically, REFER module learns latent relationships between a given question and a dialog history by employing a self-attention mechanism.

A Fast and Accurate One-Stage Approach to Visual Grounding

zyang-ur/onestage_grounding ICCV 2019

We propose a simple, fast, and accurate one-stage approach to visual grounding, inspired by the following insight.

Composing Pick-and-Place Tasks By Grounding Language

mees/AIS-Alexa-Robot 16 Feb 2021

Controlling robots to perform tasks via natural language is one of the most challenging topics in human-robot interaction.

TransVG: End-to-End Visual Grounding with Transformers

djiajunustc/TransVG ICCV 2021

In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image.

Word Discovery in Visually Grounded, Self-Supervised Speech Models

kamperh/vqwordseg 28 Mar 2022

We present a method for visually-grounded spoken term discovery.