Referring Expression Comprehension
68 papers with code • 8 benchmarks • 8 datasets
Libraries
Use these libraries to find Referring Expression Comprehension models and implementationsDatasets
Most implemented papers
Natural Language Object Retrieval
In this paper, we address the task of natural language object retrieval, to localize a target object within a given image based on a natural language query of the object.
MAttNet: Modular Attention Network for Referring Expression Comprehension
In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression.
Explainable Neural Computation via Stack Neural Module Networks
In complex inferential tasks like question answering, machine learning models must confront two challenges: the need to implement a compositional reasoning process, and, in many applications, the need for this reasoning process to be interpretable to assist users in both development and prediction.
Language-Conditioned Graph Networks for Relational Reasoning
E. g., conditioning on the "on" relationship to the plate, the object "mug" gathers messages from the object "plate" to update its representation to "mug on the plate", which can be easily consumed by a simple classifier for answer prediction.
Talk2Car: Taking Control of Your Self-Driving Car
Or more specifically, we consider the problem in an autonomous driving setting, where a passenger requests an action that can be associated with an object found in a street scene.
A Real-time Global Inference Network for One-stage Referring Expression Comprehension
Referring Expression Comprehension (REC) is an emerging research spot in computer vision, which refers to detecting the target region in an image given an text description.
Give Me Something to Eat: Referring Expression Comprehension with Commonsense Knowledge
In this case, we need to use commonsense knowledge to identify the objects in the image.
AttnGrounder: Talking to Cars with Attention
Visual grounding aims to localize a specific object in an image based on a given natural language text query.
Cosine meets Softmax: A tough-to-beat baseline for visual grounding
In this paper, we present a simple baseline for visual grounding for autonomous driving which outperforms the state of the art methods, while retaining minimal design choices.
Language-Conditioned Feature Pyramids for Visual Selection Tasks
However, few models consider the fusion of linguistic features with multiple visual features with different sizes of receptive fields, though the proper size of the receptive field of visual features intuitively varies depending on expressions.