Beyond Appearance: a Semantic Controllable Self-Supervised Learning Framework for Human-Centric Visual Tasks

Human-centric visual tasks have attracted increasing research attention due to their widespread applications. In this paper, we aim to learn a general human representation from massive unlabeled human images which can benefit downstream human-centric tasks to the maximum extent. We call this method SOLIDER, a Semantic cOntrollable seLf-supervIseD lEaRning framework. Unlike the existing self-supervised learning methods, prior knowledge from human images is utilized in SOLIDER to build pseudo semantic labels and import more semantic information into the learned representation. Meanwhile, we note that different downstream tasks always require different ratios of semantic information and appearance information. For example, human parsing requires more semantic information, while person re-identification needs more appearance information for identification purpose. So a single learned representation cannot fit for all requirements. To solve this problem, SOLIDER introduces a conditional network with a semantic controller. After the model is trained, users can send values to the controller to produce representations with different ratios of semantic information, which can fit different needs of downstream tasks. Finally, SOLIDER is verified on six downstream human-centric visual tasks. It outperforms state of the arts and builds new baselines for these tasks. The code is released in https://github.com/tinyvision/SOLIDER.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Pedestrian Detection CityPersons SOLIDER Reasonable MR^-2 9.7 # 9
Heavy MR^-2 39.4 # 7
Person Search CUHK-SYSU SOLIDER MAP 95.5 # 4
Top-1 95.8 # 6
Semantic Segmentation LIP val SOLIDER mIoU 60.50% # 4
Person Re-Identification Market-1501 SOLIDER Rank-1 96.9 # 5
mAP 93.9 # 19
Person Re-Identification Market-1501 SOLIDER (RK) Rank-1 96.7 # 9
mAP 95.6 # 2
Pose Estimation MS COCO SOLIDER (swin-B) AP 76.6 # 8
AR 81.5 # 3
Person Re-Identification MSMT17 SOLIDER (with re-ranking) Rank-1 91.7 # 1
mAP 86.5 # 2
Person Re-Identification MSMT17 SOLIDER (without re-ranking) Rank-1 90.7 # 3
mAP 77.1 # 5
Pedestrian Attribute Recognition PA-100K SOLIDER Accuracy 86.38 # 5
Person Search PRW SOLIDER mAP 59.8 # 1
Top-1 86.7 # 5

Methods


No methods listed for this paper. Add relevant methods here