Referring Expression
117 papers with code • 1 benchmarks • 3 datasets
Referring expressions places a bounding box around the instance corresponding to the provided description and image.
Libraries
Use these libraries to find Referring Expression models and implementationsMost implemented papers
Kosmos-2: Grounding Multimodal Large Language Models to the World
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e. g., bounding boxes) and grounding text to the visual world.
Described Object Detection: Liberating Object Detection with Flexible Expressions
In this paper, we advance them to a more practical setting called Described Object Detection (DOD) by expanding category names to flexible language expressions for OVD and overcoming the limitation of REC only grounding the pre-existing object.
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
Specifically, we present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets.
Localized Symbolic Knowledge Distillation for Visual Commonsense Models
Empirical results and human evaluations in a zero-shot setup demonstrate that our distillation method results in more precise VL models of reasoning compared to a baseline of passing a generated referring expression to an LLM.
Generation and Comprehension of Unambiguous Object Descriptions
We propose a method that can generate an unambiguous description (known as a referring expression) of a specific object or region in an image, and which can also comprehend or interpret such an expression to infer which object is being described.
Reasoning About Pragmatics with Neural Listeners and Speakers
We present a model for pragmatically describing scenes, in which contrastive behavior results from a combination of inference-driven pragmatics and learned semantics.
Modeling Context Between Objects for Referring Expression Understanding
Our approach uses an LSTM to learn the probability of a referring expression, with input features from a region and a context region.
Colors in Context: A Pragmatic Neural Model for Grounded Language Understanding
We present a model of pragmatic referring expression interpretation in a grounded communication task (identifying colors from descriptions) that draws upon predictions from two recurrent neural network classifiers, a speaker and a listener, unified by a recursive pragmatic reasoning framework.
Grounding Referring Expressions in Images by Variational Context
This is a general yet challenging vision-language task since it does not only require the localization of objects, but also the multimodal comprehension of context --- visual attributes (e. g., "largest", "baby") and relationships (e. g., "behind") that help to distinguish the referent from other objects, especially those of the same category.
MAttNet: Modular Attention Network for Referring Expression Comprehension
In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression.