Referring Video Object Segmentation

Referring video object segmentation aims at segmenting an object in video with language expressions. Unlike the previous video object segmentation, the task exploits a different type of supervision, language expressions, to identify and segment an object referred by the given language expressions in a video.

End-to-End Referring Video Object Segmentation with Multimodal Transformers

mttr2021/MTTR CVPR 2022

Due to the complex nature of this multimodal task, which combines text reasoning, video understanding, instance segmentation and tracking, existing approaches typically rely on sophisticated pipelines in order to tackle it.

UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces

foundationvision/uniref 25 Dec 2023

We evaluate our unified models on various benchmarks.

Cross-Modal Self-Attention Network for Referring Image Segmentation

lwye/CMSA-Net CVPR 2019

This module controls the information flow of features at different levels.

Language as Queries for Referring Video Object Segmentation

wjn922/referformer CVPR 2022

Referring video object segmentation (R-VOS) is an emerging cross-modal task that aims to segment the target object referred by a language expression in all video frames.

Local-Global Context Aware Transformer for Language-Guided Video Segmentation

leonnnop/locater 18 Mar 2022

We explore the task of language-guided video segmentation (LVS).

Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation

dzh19990407/lbdt CVPR 2022

Referring video object segmentation aims to predict foreground labels for objects referred by natural language expressions in videos.

Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus

lxa9867/R2VOS 4 Jul 2022

Referring Video Object Segmentation (R-VOS) is a challenging task that aims to segment an object in a video based on a linguistic expression.

Multi-Attention Network for Compressed Video Referring Object Segmentation

dexianghong/manet 26 Jul 2022

To address this problem, we propose a multi-attention network which consists of dual-path dual-attention module and a query-based cross-modal Transformer module.

VLT: Vision-Language Transformer and Query Generation for Referring Segmentation

henghuiding/Vision-Language-Transformer 28 Oct 2022

We propose a Vision-Language Transformer (VLT) framework for referring segmentation to facilitate deep interactions among multi-modal information and enhance the holistic understanding to vision-language features.

1st Place Solution for YouTubeVOS Challenge 2022: Referring Video Object Segmentation

zhiweihhh/cvpr2022-rvos-challenge 27 Dec 2022

The task of referring video object segmentation aims to segment the object in the frames of a given video to which the referring expressions refer.