TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
3D Human Pose Estimation	3DPW	TCFormer	PA-MPJPE	49.3	# 49
3D Human Pose Estimation	3DPW	TCFormer	MPJPE	80.6	# 57
2D Human Pose Estimation	COCO-WholeBody	TCFormer	WB	64.2	# 4
2D Human Pose Estimation	COCO-WholeBody	TCFormer	body	71.8	# 6
2D Human Pose Estimation	COCO-WholeBody	TCFormer	foot	74.4	# 3
2D Human Pose Estimation	COCO-WholeBody	TCFormer	face	79.0	# 6
2D Human Pose Estimation	COCO-WholeBody	TCFormer	hand	61.4	# 3
3D Human Pose Estimation	Human3.6M	TCFormer	Average MPJPE (mm)	62.9	# 259
3D Human Pose Estimation	Human3.6M	TCFormer	PA-MPJPE	42.8	# 76

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/not-all-tokens-are-equal-human-centric-visual/2d-human-pose-estimation-on-coco-wholebody-1)](https://paperswithcode.com/sota/2d-human-pose-estimation-on-coco-wholebody-1?p=not-all-tokens-are-equal-human-centric-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/not-all-tokens-are-equal-human-centric-visual/3d-human-pose-estimation-on-3dpw)](https://paperswithcode.com/sota/3d-human-pose-estimation-on-3dpw?p=not-all-tokens-are-equal-human-centric-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/not-all-tokens-are-equal-human-centric-visual/3d-human-pose-estimation-on-human36m)](https://paperswithcode.com/sota/3d-human-pose-estimation-on-human36m?p=not-all-tokens-are-equal-human-centric-visual)`

Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

CVPR 2022 · Wang Zeng, Sheng Jin, Wentao Liu, Chen Qian, Ping Luo, Wanli Ouyang, Xiaogang Wang ·

Vision transformers have achieved great successes in many computer vision tasks. Most methods generate vision tokens by splitting an image into a regular and fixed grid and treating each cell as a token. However, not all regions are equally important in human-centric vision tasks, e.g., the human body needs a fine representation with many tokens, while the image background can be modeled by a few tokens. To address this problem, we propose a novel Vision Transformer, called Token Clustering Transformer (TCFormer), which merges tokens by progressive clustering, where the tokens can be merged from different locations with flexible shapes and sizes. The tokens in TCFormer can not only focus on important areas but also adjust the token shapes to fit the semantic concept and adopt a fine resolution for regions containing critical details, which is beneficial to capturing detailed information. Extensive experiments show that TCFormer consistently outperforms its counterparts on different challenging human-centric tasks and datasets, including whole-body pose estimation on COCO-WholeBody and 3D human mesh reconstruction on 3DPW. Code is available at https://github.com/zengwang430521/TCFormer.git

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

zengwang430521/tcformer official

180

Tasks

Add Remove

2D Human Pose Estimation

3D Human Pose Estimation

Clustering

Pose Estimation

Datasets

MS COCO ImageNet-1K

Human3.6M

3DPW

WFLW

COCO-WholeBody

Results from the Paper

Edit

Ranked #4 on 2D Human Pose Estimation on COCO-WholeBody

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
3D Human Pose Estimation	3DPW	TCFormer	PA-MPJPE	49.3	# 49	Compare
3D Human Pose Estimation	3DPW	TCFormer	MPJPE	80.6	# 57	Compare
2D Human Pose Estimation	COCO-WholeBody	TCFormer	WB	64.2	# 4	Compare
			body	71.8	# 6	Compare
			foot	74.4	# 3	Compare
			face	79.0	# 6	Compare
			hand	61.4	# 3	Compare
3D Human Pose Estimation	Human3.6M	TCFormer	Average MPJPE (mm)	62.9	# 259	Compare
3D Human Pose Estimation	Human3.6M	TCFormer	PA-MPJPE	42.8	# 76	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Vision Transformer

Edit Social Preview

Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove