PolyFormer: Referring Image Segmentation as Sequential Polygon Generation

In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later converted into segmentation masks. This is enabled by a new sequence-to-sequence framework, Polygon Transformer (PolyFormer), which takes a sequence of image patches and text query tokens as input, and outputs a sequence of polygon vertices autoregressively. For more accurate geometric localization, we propose a regression-based decoder, which predicts the precise floating-point coordinates directly, without any coordinate quantization error. In the experiments, PolyFormer outperforms the prior art by a clear margin, e.g., 5.40% and 4.52% absolute improvements on the challenging RefCOCO+ and RefCOCOg datasets. It also shows strong generalization ability when evaluated on the referring video segmentation task without fine-tuning, e.g., achieving competitive 61.5% J&F on the Ref-DAVIS17 dataset.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Results from the Paper


 Ranked #1 on Referring Expression Segmentation on ReferIt (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Referring Expression Segmentation DAVIS 2017 (val) PolyFormer-B J&F 1st frame 60.9 # 8
Zero-Shot Transfer true # 1
Referring Expression Comprehension RefCoco+ PolyFormer-L Val 84.98 # 6
Test A 89.77 # 5
Test B 77.97 # 6
Referring Expression Comprehension RefCoco+ PolyFormer-B Val 83.73 # 7
Test A 88.6 # 7
Test B 76.38 # 7
Referring Expression Comprehension RefCOCO PolyFormer-L Val 90.38 # 8
Test A 92.89 # 6
Test B 87.16 # 6
Referring Expression Comprehension RefCOCO PolyFormer-B Val 89.73 # 9
Test A 91.73 # 8
Test B 86.03 # 7
Referring Expression Comprehension RefCOCOg-test PolyFormer-L Accuracy 85.91 # 7
Referring Expression Segmentation RefCOCOg-test PolyFormer-B Overall IoU 69.05 # 8
Mean IoU 69.88 # 2
Referring Expression Segmentation RefCOCOg-test PolyFormer-L Overall IoU 70.19 # 7
Mean IoU 71.17 # 1
Referring Expression Comprehension RefCOCOg-test PolyFormer-B Accuracy 84.96 # 8
Referring Expression Segmentation RefCOCOg-val PolyFormer-L Overall IoU 69.2 # 8
Mean IoU 71.15 # 1
Referring Expression Comprehension RefCOCOg-val PolyFormer-B Accuracy 84.46 # 9
Referring Expression Comprehension RefCOCOg-val PolyFormer-L Accuracy 85.83 # 8
Referring Expression Segmentation RefCOCOg-val PolyFormer-B Overall IoU 67.76 # 9
Mean IoU 69.36 # 2
Referring Expression Segmentation RefCOCO+ testA PolyFormer-B Overall IoU 72.89 # 9
Mean IoU 74.51 # 2
Referring Expression Segmentation RefCOCO+ testA PolyFormer-L Overall IoU 74.56 # 7
Mean IoU 75.71 # 1
Referring Expression Segmentation RefCOCO+ test B PolyFormer-L Overall IoU 61.87 # 8
Mean IoU 66.73 # 1
Referring Expression Segmentation RefCOCO+ test B PolyFormer-B Overall IoU 59.33 # 9
Mean IoU 64.64 # 2
Referring Expression Segmentation RefCoCo val PolyFormer-B Overall IoU 74.82 # 12
Referring Expression Segmentation RefCoCo val PolyFormer-L Overall IoU 75.96 # 10
Mean IoU 76.94 # 1
Referring Expression Segmentation RefCOCO+ val PolyFormer-L Overall IoU 69.33 # 10
Mean IoU 72.15 # 1
Referring Expression Segmentation RefCOCO+ val PolyFormer-B Overall IoU 67.64 # 11
Mean IoU 70.65 # 2
Referring Expression Segmentation ReferIt PolyFormer-L Overall IoU 72.6 # 1
Mean IoU 67.22 # 1
Referring Expression Segmentation ReferIt PolyFormer-B Overall IoU 71.91 # 2
Mean IoU 65.98 # 2

Methods