Referring Video Object Segmentation
41 papers with code • 4 benchmarks • 3 datasets
Referring video object segmentation aims at segmenting an object in video with language expressions. Unlike the previous video object segmentation, the task exploits a different type of supervision, language expressions, to identify and segment an object referred by the given language expressions in a video.
Most implemented papers
End-to-End Referring Video Object Segmentation with Multimodal Transformers
Due to the complex nature of this multimodal task, which combines text reasoning, video understanding, instance segmentation and tracking, existing approaches typically rely on sophisticated pipelines in order to tackle it.
LISA: Reasoning Segmentation via Large Language Model
In this work, we propose a new segmentation task -- reasoning segmentation.
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
We evaluate our unified models on various benchmarks.
VISA: Reasoning Video Object Segmentation via Large Language Models
In this paper, we introduce a new task, Reasoning Video Object Segmentation (ReasonVOS).
Cross-Modal Self-Attention Network for Referring Image Segmentation
This module controls the information flow of features at different levels.
Language as Queries for Referring Video Object Segmentation
Referring video object segmentation (R-VOS) is an emerging cross-modal task that aims to segment the target object referred by a language expression in all video frames.
Local-Global Context Aware Transformer for Language-Guided Video Segmentation
We explore the task of language-guided video segmentation (LVS).
Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation
Referring video object segmentation aims to predict foreground labels for objects referred by natural language expressions in videos.
Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus
Referring Video Object Segmentation (R-VOS) is a challenging task that aims to segment an object in a video based on a linguistic expression.
Multi-Attention Network for Compressed Video Referring Object Segmentation
To address this problem, we propose a multi-attention network which consists of dual-path dual-attention module and a query-based cross-modal Transformer module.