Referring Expression Comprehension

66 papers with code • 7 benchmarks • 7 datasets

This task has no description! Would you like to contribute one?

Libraries

Use these libraries to find Referring Expression Comprehension models and implementations

Elysium: Exploring Object-level Perception in Videos via MLLM

hon-wong/elysium 25 Mar 2024

To address the first challenge, we introduce ElysiumTrack-1M, a large-scale video dataset paired with novel tasks: Referring Single Object Tracking (RSOT) and Video Referring Expression Generation (Video-REG).

6
25 Mar 2024

An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

open-mmlab/mmdetection 4 Jan 2024

Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC).

27,469
04 Jan 2024

General Object Foundation Model for Images and Videos at Scale

FoundationVision/GLEE 14 Dec 2023

We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos.

665
14 Dec 2023

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

show-han/zeroshot_rec 28 Nov 2023

After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instance-level similarity matrix.

7
28 Nov 2023

Continual Referring Expression Comprehension via Dual Modular Memorization

zackschen/DMM 25 Nov 2023

In this paper, we propose Continual Referring Expression Comprehension (CREC), a new setting for REC, where a model is learning on a stream of incoming tasks.

1
25 Nov 2023

Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

jefferyzhan/griffon 24 Nov 2023

More importantly, we present $\textbf{Griffon}$, a purely LVLM-based baseline, which does not require the introduction of any special tokens, expert models, or additional detection modules.

58
24 Nov 2023

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

ux-decoder/segment-everything-everywhere-all-at-once 17 Oct 2023

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V.

3,949
17 Oct 2023

InstructDET: Diversifying Referring Object Detection with Generalized Instructions

jyfenggogo/instructdet 8 Oct 2023

In order to encompass common detection expressions, we involve emerging vision-language model (VLM) and large language model (LLM) to generate instructions guided by text prompts and object bbxs, as the generalizations of foundation models are effective to produce human-like expressions (e. g., describing object property, category, and relationship).

19
08 Oct 2023

Collecting Visually-Grounded Dialogue with A Game Of Sorts

willemsenbram/a-game-of-sorts LREC 2022

We address these concerns by introducing a collaborative image ranking task, a grounded agreement game we call "A Game Of Sorts".

3
10 Sep 2023

GREC: Generalized Referring Expression Comprehension

henghuiding/grefcoco 30 Aug 2023

This dataset encompasses a range of expressions: those referring to multiple targets, expressions with no specific target, and the single-target expressions.

158
30 Aug 2023