Referring Expression Comprehension
66 papers with code • 7 benchmarks • 7 datasets
Libraries
Use these libraries to find Referring Expression Comprehension models and implementationsDatasets
Latest papers
Elysium: Exploring Object-level Perception in Videos via MLLM
To address the first challenge, we introduce ElysiumTrack-1M, a large-scale video dataset paired with novel tasks: Referring Single Object Tracking (RSOT) and Video Referring Expression Generation (Video-REG).
An Open and Comprehensive Pipeline for Unified Object Grounding and Detection
Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC).
General Object Foundation Model for Images and Videos at Scale
We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos.
Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions
After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instance-level similarity matrix.
Continual Referring Expression Comprehension via Dual Modular Memorization
In this paper, we propose Continual Referring Expression Comprehension (CREC), a new setting for REC, where a model is learning on a stream of incoming tasks.
Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models
More importantly, we present $\textbf{Griffon}$, a purely LVLM-based baseline, which does not require the introduction of any special tokens, expert models, or additional detection modules.
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V.
InstructDET: Diversifying Referring Object Detection with Generalized Instructions
In order to encompass common detection expressions, we involve emerging vision-language model (VLM) and large language model (LLM) to generate instructions guided by text prompts and object bbxs, as the generalizations of foundation models are effective to produce human-like expressions (e. g., describing object property, category, and relationship).
Collecting Visually-Grounded Dialogue with A Game Of Sorts
We address these concerns by introducing a collaborative image ranking task, a grounded agreement game we call "A Game Of Sorts".
GREC: Generalized Referring Expression Comprehension
This dataset encompasses a range of expressions: those referring to multiple targets, expressions with no specific target, and the single-target expressions.