Polar Relative Positional Encoding for Video-Language Segmentation
In this paper, we tackle a challenging task named video-language segmentation. Given a video and a sentence in natural language, the goal is to segment the object or actor described by the sentence in video frames. To accurately denote a target object, the given sentence usually refers to multiple attributes, such as nearby objects with spatial relations, etc. In this paper, we propose a novel Polar Relative Positional Encoding (PRPE) mechanism that represents spatial relations in a ``linguistic'' way, i.e., in terms of direction and range. Sentence feature can interact with positional embeddings in a more direct way to extract the implied relative positional relations. We also propose parameterized functions for these positional embeddings to adapt real-value directions and ranges. With PRPE, we design a Polar Attention Module (PAM) as the basic module for vision-language fusion. Our method outperforms previous best method by a large margin of 11.4% absolute improvement in terms of mAP on the challenging A2D Sentences dataset. Our method also achieves competitive performances on the J-HMDB Sentences dataset.
PDF AbstractDatasets
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Referring Expression Segmentation | A2D Sentences | PRPE | Precision@0.5 | 0.634 | # 14 | |
Precision@0.9 | 0.083 | # 15 | ||||
IoU overall | 0.661 | # 13 | ||||
IoU mean | 0.529 | # 16 | ||||
Precision@0.6 | 0.579 | # 14 | ||||
Precision@0.7 | 0.483 | # 15 | ||||
Precision@0.8 | 0.322 | # 14 | ||||
AP | 0.388 | # 13 | ||||
Referring Expression Segmentation | J-HMDB | PRPE | Precision@0.5 | 0.572 | # 20 | |
Precision@0.6 | 0.690 | # 8 | ||||
Precision@0.7 | 0.319 | # 13 | ||||
Precision@0.8 | 0.06 | # 12 | ||||
Precision@0.9 | 0.001 | # 4 | ||||
AP | 0.294 | # 10 |