PolyFormer: Referring Image Segmentation as Sequential Polygon Generation

In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later converted into segmentation masks. This is enabled by a new sequence-to-sequence framework, Polygon Transformer (PolyFormer), which takes a sequence of image patches and text query tokens as input, and outputs a sequence of polygon vertices autoregressively. For more accurate geometric localization, we propose a regression-based decoder, which predicts the precise floating-point coordinates directly, without any coordinate quantization error. In the experiments, PolyFormer outperforms the prior art by a clear margin, e.g., 5.40% and 4.52% absolute improvements on the challenging RefCOCO+ and RefCOCOg datasets. It also shows strong generalization ability when evaluated on the referring video segmentation task without fine-tuning, e.g., achieving competitive 61.5% J&F on the Ref-DAVIS17 dataset.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Results from the Paper


 Ranked #1 on Referring Expression Segmentation on ReferIt (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Referring Expression Segmentation DAVIS 2017 (val) PolyFormer J&F 1st frame 61.5 # 4
Referring Expression Comprehension RefCOCO PolyFormer-L Val 90.38 # 4
Test A 92.89 # 4
Test B 87.16 # 4
Referring Expression Comprehension RefCOCO PolyFormer-B Val 89.73 # 5
Test A 91.73 # 5
Test B 86.03 # 5
Referring Expression Segmentation RefCOCOg-test PolyFormer-B Overall IoU 69.05 # 4
Mean IoU 69.88 # 2
Referring Expression Comprehension RefCOCOg-test PolyFormer-B Accuracy 84.96 # 5
Referring Expression Segmentation RefCOCOg-test PolyFormer-L Overall IoU 70.19 # 3
Mean IoU 71.17 # 1
Referring Expression Comprehension RefCOCOg-test PolyFormer-L Accuracy 85.91 # 4
Referring Expression Segmentation RefCOCOg-val PolyFormer-L Overall IoU 69.2 # 3
Mean IoU 71.15 # 1
Referring Expression Comprehension RefCOCOg-val PolyFormer-B Accuracy 84.46 # 5
Referring Expression Comprehension RefCOCOg-val PolyFormer-L Accuracy 85.83 # 4
Referring Expression Segmentation RefCOCOg-val PolyFormer-B Overall IoU 67.76 # 4
Mean IoU 69.36 # 2
Referring Expression Segmentation RefCOCO+ testA PolyFormer-B Overall IoU 72.89 # 5
Mean IoU 74.51 # 2
Referring Expression Segmentation RefCOCO+ testA PolyFormer-L Overall IoU 74.56 # 4
Mean IoU 75.71 # 1
Referring Expression Segmentation RefCOCO+ test B PolyFormer-L Overall IoU 61.87 # 4
Mean IoU 66.73 # 1
Referring Expression Segmentation RefCOCO+ test B PolyFormer-B Overall IoU 59.33 # 5
Mean IoU 64.64 # 2
Referring Expression Segmentation RefCOCO+ val PolyFormer-B Overall IoU 67.64 # 5
Mean IoU 70.65 # 2
Referring Expression Segmentation RefCOCO+ val PolyFormer-L Overall IoU 69.33 # 4
Mean IoU 72.15 # 1
Referring Expression Segmentation ReferIt PolyFormer-L Overall IoU 72.6 # 1
Mean IoU 67.22 # 1
Referring Expression Segmentation ReferIt PolyFormer-B Overall IoU 71.91 # 2
Mean IoU 65.98 # 2

Methods