Referring Expression Comprehension

66 papers with code • 7 benchmarks • 7 datasets

This task has no description! Would you like to contribute one?

Libraries

Use these libraries to find Referring Expression Comprehension models and implementations

Most implemented papers

A Joint Speaker-Listener-Reinforcer Model for Referring Expressions

lichengunc/speaker_listener_reinforcer CVPR 2017

The speaker generates referring expressions, the listener comprehends referring expressions, and the reinforcer introduces a reward function to guide sampling of more discriminative expressions.

A Fast and Accurate One-Stage Approach to Visual Grounding

zyang-ur/onestage_grounding ICCV 2019

We propose a simple, fast, and accurate one-stage approach to visual grounding, inspired by the following insight.

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

zhegan27/VILLA NeurIPS 2020

We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.

Unifying Vision-and-Language Tasks via Text Generation

j-min/VL-T5 4 Feb 2021

On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models.

TransVG: End-to-End Visual Grounding with Transformers

djiajunustc/TransVG ICCV 2021

In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image.

ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension

allenai/reclip ACL 2022

Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain.

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

OFA-Sys/ONE-PEACE 18 May 2023

In this work, we explore a scalable way for building a general representation model toward unlimited modalities.

Kosmos-2: Grounding Multimodal Large Language Models to the World

microsoft/unilm 26 Jun 2023

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e. g., bounding boxes) and grounding text to the visual world.

Described Object Detection: Liberating Object Detection with Flexible Expressions

charles-xie/awesome-described-object-detection NeurIPS 2023

In this paper, we advance them to a more practical setting called Described Object Detection (DOD) by expanding category names to flexible language expressions for OVD and overcoming the limitation of REC only grounding the pre-existing object.

Natural Language Object Retrieval

ronghanghu/natural-language-object-retrieval CVPR 2016

In this paper, we address the task of natural language object retrieval, to localize a target object within a given image based on a natural language query of the object.