Hierarchical interaction network for video object segmentation from referring expressions

In this paper, we investigate the problem of video object segmentation from referring expressions (VOSRE). Conventional methods typically perform multi-modal fusion based on linguistic features and the visual features extracted from the top layer of the visual encoder, which limits these models' ability to represent multi-modal inputs at different semantic and spatial granularity levels. To address this issue, we present an end-to-end hierarchical interaction network (HINet) for the VOSRE problem. Our model leverages the feature pyramid produced by the visual encoder to generate multiple levels of multi-modal features. This allows more flexible representation of various linguistic concepts (e.g., object attributes and categories) in different levels of the multi-modal features. Moreover, we further extract signals of moving objects from optical flow input, and utilize them as complementary cues for highlighting the referent and suppressing the background with a motion gating mechanism. In contrast to previous methods, this strategy allows our model to make online predictions without requiring the whole video as input. Despite its simplicity, our proposed HINet improves over the previous state of the art on the DAVIS-16, DAVIS-17, and J-HMDB datasets for the VOSRE task, demonstrating its effectiveness and generality.

PDF Abstract

Results from the Paper


 Ranked #1 on Referring Expression Segmentation on J-HMDB (Precision@0.9 metric)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Referring Expression Segmentation A2D Sentences RefVOS Precision@0.5 0.578 # 19
Precision@0.9 0.093 # 14
IoU overall 0.672 # 12
IoU mean 0.497 # 20
Precision@0.6 0.534 # 17
Precision@0.7 0.456 # 17
Precision@0.8 0.311 # 16
Referring Expression Segmentation A2D Sentences HINet Precision@0.5 0.611 # 16
Precision@0.9 0.12 # 11
IoU overall 0.679 # 10
IoU mean 0.529 # 17
Precision@0.6 0.559 # 16
Precision@0.7 0.486 # 15
Precision@0.8 0.342 # 12
Referring Expression Segmentation DAVIS 2017 (val) HINet J&F 1st frame 50.2 # 7
J&F Full video 47.9 # 2
Referring Expression Segmentation J-HMDB RefVOS Precision@0.5 0.731 # 15
Precision@0.6 0.62 # 14
Precision@0.7 0.392 # 9
Precision@0.8 0.088 # 10
Precision@0.9 0.0 # 11
IoU overall 0.606 # 11
IoU mean 0.568 # 16
Referring Expression Segmentation J-HMDB HINet Precision@0.5 0.819 # 8
Precision@0.6 0.736 # 8
Precision@0.7 0.542 # 8
Precision@0.8 0.168 # 5
Precision@0.9 0.4 # 1
IoU overall 0.652 # 7
IoU mean 0.627 # 8

Methods


No methods listed for this paper. Add relevant methods here