Generalized Referring Expression Segmentation
9 papers with code • 1 benchmarks • 1 datasets
Generalized Referring Expression Segmentation (GRES), introduced by Liu et al in CVPR 2023, allows expressions indicating any number of target objects. GRES takes an image and a referring expression as input, and requires mask prediction of the target object(s).
Most implemented papers
GRES: Generalized Referring Expression Segmentation
Existing classic RES datasets and methods commonly support single-target expressions only, i. e., one expression refers to one target object.
CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation
By decoupling the intricate referring semantics into different granularity with a visual-linguistic hierarchy, and dynamic aggregating it with intra- and inter-selection, CoHD boosts multi-granularity comprehension with the reciprocal benefit of the hierarchical nature.
MAttNet: Modular Attention Network for Referring Expression Comprehension
In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression.
Vision-Language Transformer and Query Generation for Referring Segmentation
We introduce transformer and multi-head attention to build a network with an encoder-decoder attention mechanism architecture that "queries" the given image with the language expression.
CRIS: CLIP-Driven Referring Image Segmentation
In addition, we present text-to-pixel contrastive learning to explicitly enforce the text feature similar to the related pixel-level features and dissimilar to the irrelevances.
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
Referring image segmentation is a fundamental vision-language task that aims to segment out an object referred to by a natural language expression from an image.
GSVA: Generalized Segmentation via Multimodal Large Language Models
Generalized Referring Expression Segmentation (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image.
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model
PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the segmentation task challenges.
Bring Adaptive Binding Prototypes to Generalized Referring Expression Segmentation
Referring Expression Segmentation (RES) has attracted rising attention, aiming to identify and segment objects based on natural language expressions.