TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Pose Estimation	COCO test-dev	ViTPose (ViTAE-G)	AP	80.9	# 2
Pose Estimation	COCO test-dev	ViTPose (ViTAE-G)	AP50	94.8	# 2
Pose Estimation	COCO test-dev	ViTPose (ViTAE-G)	AP75	88.1	# 2
Pose Estimation	COCO test-dev	ViTPose (ViTAE-G)	APL	85.9	# 2
Pose Estimation	COCO test-dev	ViTPose (ViTAE-G)	APM	77.5	# 6
Pose Estimation	COCO test-dev	ViTPose (ViTAE-G)	AR	85.4	# 3
Pose Estimation	COCO test-dev	ViTPose (ViTAE-G, ensemble)	AP	81.1	# 1
Pose Estimation	COCO test-dev	ViTPose (ViTAE-G, ensemble)	AP50	95.0	# 1
Pose Estimation	COCO test-dev	ViTPose (ViTAE-G, ensemble)	AP75	88.2	# 1
Pose Estimation	COCO test-dev	ViTPose (ViTAE-G, ensemble)	APL	86.0	# 1
Pose Estimation	COCO test-dev	ViTPose (ViTAE-G, ensemble)	APM	77.8	# 5
Pose Estimation	COCO test-dev	ViTPose (ViTAE-G, ensemble)	AR	85.6	# 2
Pose Estimation	CrowdPose	ViTPose-G	AP	78.3	# 2
Pose Estimation	CrowdPose	ViTPose-G	AP50	85.3	# 5
Pose Estimation	CrowdPose	ViTPose-G	AP75	81.4	# 1
Pose Estimation	CrowdPose	ViTPose-G	APM	86.6	# 1
Pose Estimation	CrowdPose	ViTPose-G	AP Hard	67.9	# 2
2D Human Pose Estimation	Human-Art	ViTPose-h	AP	0.468	# 3
2D Human Pose Estimation	Human-Art	ViTPose-h	AP (gt bbox)	0.800	# 1
2D Human Pose Estimation	Human-Art	ViTPose-l	AP	0.459	# 4
2D Human Pose Estimation	Human-Art	ViTPose-l	AP (gt bbox)	0.789	# 2
2D Human Pose Estimation	Human-Art	ViTpose-b	AP	0.410	# 6
2D Human Pose Estimation	Human-Art	ViTpose-b	AP (gt bbox)	0.759	# 4
2D Human Pose Estimation	Human-Art	ViTPose-s	AP	0.381	# 8
2D Human Pose Estimation	Human-Art	ViTPose-s	AP (gt bbox)	0.738	# 7
Pose Estimation	OCHuman	ViTPose (ViTAE-G, GT bounding boxes)	Test AP	93.3	# 1
Pose Estimation	OCHuman	ViTPose (ViTAE-G, GT bounding boxes)	Validation AP	92.8	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vitpose-simple-vision-transformer-baselines/pose-estimation-on-coco-test-dev)](https://paperswithcode.com/sota/pose-estimation-on-coco-test-dev?p=vitpose-simple-vision-transformer-baselines)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vitpose-simple-vision-transformer-baselines/pose-estimation-on-ochuman)](https://paperswithcode.com/sota/pose-estimation-on-ochuman?p=vitpose-simple-vision-transformer-baselines)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vitpose-simple-vision-transformer-baselines/pose-estimation-on-crowdpose)](https://paperswithcode.com/sota/pose-estimation-on-crowdpose?p=vitpose-simple-vision-transformer-baselines)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vitpose-simple-vision-transformer-baselines/2d-human-pose-estimation-on-human-art)](https://paperswithcode.com/sota/2d-human-pose-estimation-on-human-art?p=vitpose-simple-vision-transformer-baselines)`

ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

26 Apr 2022 · Yufei Xu, Jing Zhang, Qiming Zhang, DaCheng Tao ·

Although no specific domain knowledge is considered in the design, plain vision transformers have shown excellent performance in visual recognition tasks. However, little effort has been made to reveal the potential of such simple structures for pose estimation tasks. In this paper, we show the surprisingly good capabilities of plain vision transformers for pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model called ViTPose. Specifically, ViTPose employs plain and non-hierarchical vision transformers as backbones to extract features for a given person instance and a lightweight decoder for pose estimation. It can be scaled up from 100M to 1B parameters by taking the advantages of the scalable model capacity and high parallelism of transformers, setting a new Pareto front between throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, pre-training and finetuning strategy, as well as dealing with multiple pose tasks. We also empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token. Experimental results show that our basic ViTPose model outperforms representative methods on the challenging MS COCO Keypoint Detection benchmark, while the largest model sets a new state-of-the-art. The code and models are available at https://github.com/ViTAE-Transformer/ViTPose.

PDF Abstract

Code

Add Remove Mark official

vitae-transformer/vitpose official

↳ Quickstart in

Spaces

1,169

vitae-transformer/qformer

119

JunkyByte/easy_ViTPose

↳ Quickstart in

Colab

jaehyunnn/ViTPose_pytorch

gpastal24/ViTPose-Pytorch

Tasks

Add Remove

2D Human Pose Estimation

Keypoint Detection

Pose Estimation

Datasets

ImageNet

MS COCO

MPII

CrowdPose

OCHuman Human-Art

Results from the Paper

Edit

Ranked #1 on Pose Estimation on COCO test-dev

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Pose Estimation	COCO test-dev	ViTPose (ViTAE-G)	AP	80.9	# 2	Compare
			AP50	94.8	# 2	Compare
			AP75	88.1	# 2	Compare
			APL	85.9	# 2	Compare
			APM	77.5	# 6	Compare
			AR	85.4	# 3	Compare
Pose Estimation	COCO test-dev	ViTPose (ViTAE-G, ensemble)	AP	81.1	# 1	Compare
			AP50	95.0	# 1	Compare
			AP75	88.2	# 1	Compare
			APL	86.0	# 1	Compare
			APM	77.8	# 5	Compare
			AR	85.6	# 2	Compare
Pose Estimation	CrowdPose	ViTPose-G	AP	78.3	# 2	Compare
			AP50	85.3	# 5	Compare
			AP75	81.4	# 1	Compare
			APM	86.6	# 1	Compare
			AP Hard	67.9	# 2	Compare
2D Human Pose Estimation	Human-Art	ViTPose-h	AP	0.468	# 3	Compare
2D Human Pose Estimation	Human-Art	ViTPose-h	AP (gt bbox)	0.800	# 1	Compare
2D Human Pose Estimation	Human-Art	ViTPose-l	AP	0.459	# 4	Compare
2D Human Pose Estimation	Human-Art	ViTPose-l	AP (gt bbox)	0.789	# 2	Compare
2D Human Pose Estimation	Human-Art	ViTpose-b	AP	0.410	# 6	Compare
2D Human Pose Estimation	Human-Art	ViTpose-b	AP (gt bbox)	0.759	# 4	Compare
2D Human Pose Estimation	Human-Art	ViTPose-s	AP	0.381	# 8	Compare
2D Human Pose Estimation	Human-Art	ViTPose-s	AP (gt bbox)	0.738	# 7	Compare
Pose Estimation	OCHuman	ViTPose (ViTAE-G, GT bounding boxes)	Test AP	93.3	# 1	Compare
Pose Estimation	OCHuman	ViTPose (ViTAE-G, GT bounding boxes)	Validation AP	92.8	# 1	Compare

Methods

Add Remove

Dense Connections • Layer Normalization • Linear Layer • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Vision Transformer

Edit Social Preview

ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove