Multi-Level Representation Learning With Semantic Alignment for Referring Video Object Segmentation

Referring video object segmentation (RVOS) is a challenging language-guided video grounding task, which requires comprehensively understanding the semantic information of both video content and language queries for object prediction. However, existing methods adopt multi-modal fusion at a frame-based spatial granularity. The limitation of visual representation is prone to causing vision-language mismatching and producing poor segmentation results. To address this, we propose a novel multi-level representation learning approach, which explores the inherent structure of the video content to provide a set of discriminative visual embedding, enabling more effective vision-language semantic alignment. Specifically, we embed different visual cues in terms of visual granularity, including multi-frame long-temporal information at video level, intra-frame spatial semantics at frame level, and enhanced object-aware feature prior at object level. With the powerful multi-level visual embedding and carefully-designed dynamic alignment, our model can generate a robust representation for accurate video object segmentation. Extensive experiments on Refer-DAVIS_ 17 and Refer-YouTube-VOS demonstrate that our model achieves superior performance both in segmentation accuracy and inference speed.

PDF Abstract

Datasets


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Referring Expression Segmentation Refer-YouTube-VOS (2021 public validation) MLRLSA J&F 49.70 # 23
J 50.96 # 21
F 48.43 # 24

Methods


No methods listed for this paper. Add relevant methods here