ViTPose+: Vision Transformer Foundation Model for Generic Body Pose Estimation

7 Dec 2022  ·  Yufei Xu, Jing Zhang, Qiming Zhang, DaCheng Tao ·

In this paper, we show the surprisingly good properties of plain vision transformers for body pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model dubbed ViTPose. Specifically, ViTPose employs the plain and non-hierarchical vision transformer as an encoder to encode features and a lightweight decoder to decode body keypoints in either a top-down or a bottom-up manner. It can be scaled up from about 20M to 1B parameters by taking advantage of the scalable model capacity and high parallelism of the vision transformer, setting a new Pareto front for throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, and pre-training and fine-tuning strategy. Based on the flexibility, a novel ViTPose+ model is proposed to deal with heterogeneous body keypoint categories in different types of body pose estimation tasks via knowledge factorization, i.e., adopting task-agnostic and task-specific feed-forward networks in the transformer. We also empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token. Experimental results show that our ViTPose model outperforms representative methods on the challenging MS COCO Human Keypoint Detection benchmark at both top-down and bottom-up settings. Furthermore, our ViTPose+ model achieves state-of-the-art performance simultaneously on a series of body pose estimation tasks, including MS COCO, AI Challenger, OCHuman, MPII for human keypoint detection, COCO-Wholebody for whole-body keypoint detection, as well as AP-10K and APT-36K for animal keypoint detection, without sacrificing inference speed.

PDF Abstract

Results from the Paper

 Ranked #1 on Animal Pose Estimation on AP-10K (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Animal Pose Estimation AP-10K SimpleBaseline-ResNet50 AP 68.1 # 9
Animal Pose Estimation AP-10K ViTPose+-B AP 74.5 # 5
Animal Pose Estimation AP-10K ViTPose+-L AP 80.4 # 2
Animal Pose Estimation AP-10K ViTPose+-H AP 82.4 # 1
Animal Pose Estimation AP-10K ViTPose+-S ViT-S AP 71.4 # 8
Animal Pose Estimation AP-10K HRNet-w48 AP 73.1 # 6
Animal Pose Estimation AP-10K HRNet-w32 AP 72.2 # 7
2D Human Pose Estimation COCO-WholeBody ViTPose+-H WB 61.2 # 6
body 75.9 # 1
foot 77.9 # 2
face 63.3 # 8
hand 54.7 # 6