Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation

In this paper, we address the semantic segmentation problem with a focus on the context aggregation strategy. Our motivation is that the label of a pixel is the category of the object that the pixel belongs to. We present a simple yet effective approach, object-contextual representations, characterizing a pixel by exploiting the representation of the corresponding object class. First, we learn object regions under the supervision of ground-truth segmentation. Second, we compute the object region representation by aggregating the representations of the pixels lying in the object region. Last, % the representation similarity we compute the relation between each pixel and each object region and augment the representation of each pixel with the object-contextual representation which is a weighted aggregation of all the object region representations according to their relations with the pixel. We empirically demonstrate that the proposed approach achieves competitive performance on various challenging semantic segmentation benchmarks: Cityscapes, ADE20K, LIP, PASCAL-Context, and COCO-Stuff. Cityscapes, ADE20K, LIP, PASCAL-Context, and COCO-Stuff. Our submission "HRNet + OCR + SegFix" achieves 1-st place on the Cityscapes leaderboard by the time of submission. Code is available at: https://git.io/openseg and https://git.io/HRNet.OCR. We rephrase the object-contextual representation scheme using the Transformer encoder-decoder framework. The details are presented in~Section3.3.

PDF Abstract ECCV 2020 PDF ECCV 2020 Abstract

Results from the Paper


 Ranked #1 on Semantic Segmentation on Cityscapes test (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Semantic Segmentation ADE20K val OCR (ResNet-101) mIoU 45.28 # 54
Semantic Segmentation ADE20K val OCR (HRNetV2-W48) mIoU 45.66 # 52
Semantic Segmentation ADE20K val HRNetV2 + OCR + RMI (PaddleClas pretrained) mIoU 47.98 # 39
Semantic Segmentation Cityscapes test HRNetV2 + OCR (w/ ASP) Mean IoU (class) 83.7% # 5
Semantic Segmentation Cityscapes test OCR (ResNet-101, coarse) Mean IoU (class) 82.4% # 19
Semantic Segmentation Cityscapes test OCR (ResNet-101) Mean IoU (class) 81.8% # 26
Semantic Segmentation Cityscapes test HRNetV2 + OCR + Mean IoU (class) 84.5% # 1
Semantic Segmentation Cityscapes test OCR (HRNetV2-W48, coarse) Mean IoU (class) 83.0% # 12
Semantic Segmentation Cityscapes val HRNetV2 + OCR + RMI (PaddleClas pretrained) mIoU 83.6 # 6
Semantic Segmentation Cityscapes val OCR (ResNet-101-FCN) mIoU 80.6 # 20
Semantic Segmentation COCO-Stuff test OCR (ResNet-101) mIoU 39.5% # 10
Semantic Segmentation COCO-Stuff test OCR (HRNetV2-W48) mIoU 40.5% # 6
Semantic Segmentation COCO-Stuff test HRNetV2 + OCR + RMI (PaddleClas pretrained) mIoU 45.2% # 3
Semantic Segmentation LIP val OCR (HRNetV2-W48) mIoU 56.65% # 3
Semantic Segmentation LIP val OCR (ResNet-101) mIoU 55.6% # 5
Semantic Segmentation LIP val HRNetV2 + OCR + RMI (PaddleClas pretrained) mIoU 58.2% # 2
Semantic Segmentation PASCAL Context OCR (ResNet-101) mIoU 54.8 # 21
Semantic Segmentation PASCAL Context OCR (HRNetV2-W48) mIoU 56.2 # 13
Semantic Segmentation PASCAL Context HRNetV2 + OCR + RMI (PaddleClas pretrained) mIoU 59.6 # 6
Semantic Segmentation PASCAL VOC 2012 test OCR (ResNet-101) Mean IoU 84.3% # 21
Semantic Segmentation PASCAL VOC 2012 test OCR (HRNetV2-W48) Mean IoU 84.5% # 20

Methods