Polar Relative Positional Encoding for Video-Language Segmentation

20 Jul 2020  ·  Ke Ning, Lingxi Xie, Fei Wu, Qi Tian ·

In this paper, we tackle a challenging task named video-language segmentation. Given a video and a sentence in natural language, the goal is to segment the object or actor described by the sentence in video frames. To accurately denote a target object, the given sentence usually refers to multiple attributes, such as nearby objects with spatial relations, etc. In this paper, we propose a novel Polar Relative Positional Encoding (PRPE) mechanism that represents spatial relations in a ``linguistic'' way, i.e., in terms of direction and range. Sentence feature can interact with positional embeddings in a more direct way to extract the implied relative positional relations. We also propose parameterized functions for these positional embeddings to adapt real-value directions and ranges. With PRPE, we design a Polar Attention Module (PAM) as the basic module for vision-language fusion. Our method outperforms previous best method by a large margin of 11.4% absolute improvement in terms of mAP on the challenging A2D Sentences dataset. Our method also achieves competitive performances on the J-HMDB Sentences dataset.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Referring Expression Segmentation A2D Sentences PRPE Precision@0.5 0.634 # 9
Precision@0.9 0.083 # 10
IoU overall 0.661 # 8
IoU mean 0.529 # 11
Precision@0.6 0.579 # 9
Precision@0.7 0.483 # 10
Precision@0.8 0.322 # 9
AP 0.388 # 8
Referring Expression Segmentation J-HMDB PRPE Precision@0.5 0.572 # 18
Precision@0.6 0.690 # 6
Precision@0.7 0.319 # 11
Precision@0.8 0.06 # 10
Precision@0.9 0.001 # 4
AP 0.294 # 8


No methods listed for this paper. Add relevant methods here