TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Pose Estimation	COCO test-dev	SwinV2-B 1K-MIM	AP	76.7	# 13
Pose Estimation	COCO test-dev	SwinV2-L 1K-MIM	AP	77.2	# 10
Pose Estimation	CrowdPose	SwinV2-L 1K-MIM	AP	75.5	# 4
Pose Estimation	CrowdPose	SwinV2-B 1K-MIM	AP	74.9	# 5
Visual Object Tracking	GOT-10k	SwinV2-B 1K-MIM	Average Overlap	70.8	# 14
Visual Object Tracking	GOT-10k	SwinV2-L 1K-MIM	Average Overlap	72.9	# 12
Monocular Depth Estimation	KITTI Eigen split	SwinV2-B 1K-MIM	absolute relative error	0.052	# 20
Monocular Depth Estimation	KITTI Eigen split	SwinV2-B 1K-MIM	RMSE	2.050	# 17
Monocular Depth Estimation	KITTI Eigen split	SwinV2-B 1K-MIM	Sq Rel	0.148	# 10
Monocular Depth Estimation	KITTI Eigen split	SwinV2-B 1K-MIM	RMSE log	0.078	# 20
Monocular Depth Estimation	KITTI Eigen split	SwinV2-B 1K-MIM	Delta < 1.25	0.976	# 18
Monocular Depth Estimation	KITTI Eigen split	SwinV2-B 1K-MIM	Delta < 1.25^2	0.998	# 1
Monocular Depth Estimation	KITTI Eigen split	SwinV2-B 1K-MIM	Delta < 1.25^3	0.999	# 11
Monocular Depth Estimation	KITTI Eigen split	SwinV2-L 1K-MIM	absolute relative error	0.050	# 13
Monocular Depth Estimation	KITTI Eigen split	SwinV2-L 1K-MIM	RMSE	1.966	# 9
Monocular Depth Estimation	KITTI Eigen split	SwinV2-L 1K-MIM	Sq Rel	0.139	# 17
Monocular Depth Estimation	KITTI Eigen split	SwinV2-L 1K-MIM	RMSE log	0.075	# 12
Monocular Depth Estimation	KITTI Eigen split	SwinV2-L 1K-MIM	Delta < 1.25	0.977	# 13
Monocular Depth Estimation	KITTI Eigen split	SwinV2-L 1K-MIM	Delta < 1.25^2	0.998	# 1
Monocular Depth Estimation	KITTI Eigen split	SwinV2-L 1K-MIM	Delta < 1.25^3	1.000	# 1
Visual Object Tracking	LaSOT	SwinV2-B 1K-MIM	AUC	70	# 18
Visual Object Tracking	LaSOT	SwinV2-L 1K-MIM	AUC	70.7	# 14
Depth Estimation	NYU-Depth V2	SwinV2-B 1K-MIM	RMS	0.304	# 5
Depth Estimation	NYU-Depth V2	SwinV2-L 1K-MIM	RMS	0.287	# 3
Monocular Depth Estimation	NYU-Depth V2	SwinV2-L 1K-MIM	RMSE	0.287	# 17
Monocular Depth Estimation	NYU-Depth V2	SwinV2-L 1K-MIM	absolute relative error	0.083	# 18
Monocular Depth Estimation	NYU-Depth V2	SwinV2-L 1K-MIM	Delta < 1.25	0.949	# 17
Monocular Depth Estimation	NYU-Depth V2	SwinV2-L 1K-MIM	Delta < 1.25^2	0.994	# 14
Monocular Depth Estimation	NYU-Depth V2	SwinV2-L 1K-MIM	Delta < 1.25^3	0.999	# 4
Monocular Depth Estimation	NYU-Depth V2	SwinV2-L 1K-MIM	log 10	0.035	# 18

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/revealing-the-dark-secrets-of-masked-image/depth-estimation-on-nyu-depth-v2)](https://paperswithcode.com/sota/depth-estimation-on-nyu-depth-v2?p=revealing-the-dark-secrets-of-masked-image)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/revealing-the-dark-secrets-of-masked-image/pose-estimation-on-crowdpose)](https://paperswithcode.com/sota/pose-estimation-on-crowdpose?p=revealing-the-dark-secrets-of-masked-image)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/revealing-the-dark-secrets-of-masked-image/pose-estimation-on-coco-test-dev)](https://paperswithcode.com/sota/pose-estimation-on-coco-test-dev?p=revealing-the-dark-secrets-of-masked-image)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/revealing-the-dark-secrets-of-masked-image/visual-object-tracking-on-got-10k)](https://paperswithcode.com/sota/visual-object-tracking-on-got-10k?p=revealing-the-dark-secrets-of-masked-image)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/revealing-the-dark-secrets-of-masked-image/monocular-depth-estimation-on-kitti-eigen)](https://paperswithcode.com/sota/monocular-depth-estimation-on-kitti-eigen?p=revealing-the-dark-secrets-of-masked-image)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/revealing-the-dark-secrets-of-masked-image/visual-object-tracking-on-lasot)](https://paperswithcode.com/sota/visual-object-tracking-on-lasot?p=revealing-the-dark-secrets-of-masked-image)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/revealing-the-dark-secrets-of-masked-image/monocular-depth-estimation-on-nyu-depth-v2)](https://paperswithcode.com/sota/monocular-depth-estimation-on-nyu-depth-v2?p=revealing-the-dark-secrets-of-masked-image)`

Revealing the Dark Secrets of Masked Image Modeling

CVPR 2023 · Zhenda Xie, Zigang Geng, Jingcheng Hu, Zheng Zhang, Han Hu, Yue Cao ·

Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear. In this paper, we compare MIM with the long-dominant supervised pre-trained models from two perspectives, the visualizations and the experiments, to uncover their key representational differences. From the visualizations, we find that MIM brings locality inductive bias to all layers of the trained models, but supervised models tend to focus locally at lower layers but more globally at higher layers. That may be the reason why MIM helps Vision Transformers that have a very large receptive field to optimize. Using MIM, the model can maintain a large diversity on attention heads in all layers. But for supervised models, the diversity on attention heads almost disappears from the last three layers and less diversity harms the fine-tuning performance. From the experiments, we find that MIM models can perform significantly better on geometric and motion tasks with weak semantics or fine-grained classification tasks, than their supervised counterparts. Without bells and whistles, a standard MIM pre-trained SwinV2-L could achieve state-of-the-art performance on pose estimation (78.9 AP on COCO test-dev and 78.0 AP on CrowdPose), depth estimation (0.287 RMSE on NYUv2 and 1.966 RMSE on KITTI), and video object tracking (70.7 SUC on LaSOT). For the semantic understanding datasets where the categories are sufficiently covered by the supervised pre-training, MIM models can still achieve highly competitive transfer performance. With a deeper understanding of MIM, we hope that our work can inspire new and solid research in this direction.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Code

Add Remove Mark official

SwinTransformer/MIM-Depth-Estimation official

153

Tasks

Add Remove

Depth Estimation

Inductive Bias

Monocular Depth Estimation

Object Tracking

Pose Estimation

Video Object Tracking

Visual Object Tracking

Datasets

ImageNet

MS COCO

KITTI

NYUv2

iNaturalist

LaSOT

GOT-10k

TrackingNet

CrowdPose

Results from the Paper

Edit

Ranked #3 on Depth Estimation on NYU-Depth V2

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Pose Estimation	COCO test-dev	SwinV2-B 1K-MIM	AP	76.7	# 13	Compare
Pose Estimation	COCO test-dev	SwinV2-L 1K-MIM	AP	77.2	# 10	Compare
Pose Estimation	CrowdPose	SwinV2-L 1K-MIM	AP	75.5	# 4	Compare
Pose Estimation	CrowdPose	SwinV2-B 1K-MIM	AP	74.9	# 5	Compare
Visual Object Tracking	GOT-10k	SwinV2-B 1K-MIM	Average Overlap	70.8	# 14	Compare
Visual Object Tracking	GOT-10k	SwinV2-L 1K-MIM	Average Overlap	72.9	# 12	Compare
Monocular Depth Estimation	KITTI Eigen split	SwinV2-B 1K-MIM	absolute relative error	0.052	# 20	Compare
			RMSE	2.050	# 17	Compare
			Sq Rel	0.148	# 10	Compare
			RMSE log	0.078	# 20	Compare
			Delta < 1.25	0.976	# 18	Compare
			Delta < 1.25^2	0.998	# 1	Compare
			Delta < 1.25^3	0.999	# 11	Compare
Monocular Depth Estimation	KITTI Eigen split	SwinV2-L 1K-MIM	absolute relative error	0.050	# 13	Compare
			RMSE	1.966	# 9	Compare
			Sq Rel	0.139	# 17	Compare
			RMSE log	0.075	# 12	Compare
			Delta < 1.25	0.977	# 13	Compare
			Delta < 1.25^2	0.998	# 1	Compare
			Delta < 1.25^3	1.000	# 1	Compare
Visual Object Tracking	LaSOT	SwinV2-B 1K-MIM	AUC	70	# 18	Compare
Visual Object Tracking	LaSOT	SwinV2-L 1K-MIM	AUC	70.7	# 14	Compare
Depth Estimation	NYU-Depth V2	SwinV2-B 1K-MIM	RMS	0.304	# 5	Compare
Depth Estimation	NYU-Depth V2	SwinV2-L 1K-MIM	RMS	0.287	# 3	Compare
Monocular Depth Estimation	NYU-Depth V2	SwinV2-L 1K-MIM	RMSE	0.287	# 17	Compare
			absolute relative error	0.083	# 18	Compare
			Delta < 1.25	0.949	# 17	Compare
			Delta < 1.25^2	0.994	# 14	Compare
			Delta < 1.25^3	0.999	# 4	Compare
			log 10	0.035	# 18	Compare

Methods

Add Remove

MIM

Edit Social Preview

Revealing the Dark Secrets of Masked Image Modeling

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove