TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic Segmentation	ADE20K	EVA	Validation mIoU	62.3	# 5
Semantic Segmentation	ADE20K	EVA	Params (M)	1074	# 7
Semantic Segmentation	ADE20K val	EVA	mIoU	61.5	# 3
Instance Segmentation	COCO minival	EVA	mask AP	55.0	# 2
Instance Segmentation	COCO minival	EVA	AP50	79.4	# 2
Instance Segmentation	COCO minival	EVA	AP75	60.9	# 2
Instance Segmentation	COCO minival	EVA	APL	72.0	# 3
Instance Segmentation	COCO minival	EVA	APM	58.4	# 1
Instance Segmentation	COCO minival	EVA	APS	37.6	# 2
Object Detection	COCO minival	EVA	box AP	64.5	# 6
Object Detection	COCO minival	EVA	AP50	82.1	# 1
Object Detection	COCO minival	EVA	AP75	70.8	# 1
Object Detection	COCO minival	EVA	APS	49.4	# 1
Object Detection	COCO minival	EVA	APM	68.4	# 1
Object Detection	COCO minival	EVA	APL	78.5	# 1
Object Detection	COCO-O	EVA	Average mAP	57.8	# 1
Object Detection	COCO-O	EVA	Effective Robustness	28.86	# 1
Semantic Segmentation	COCO-Stuff test	EVA	mIoU	53.4%	# 1
Instance Segmentation	COCO test-dev	EVA	mask AP	55.5	# 1
Instance Segmentation	COCO test-dev	EVA	AP50	80.0	# 2
Instance Segmentation	COCO test-dev	EVA	APS	36.3	# 3
Instance Segmentation	COCO test-dev	EVA	APM	58.0	# 3
Instance Segmentation	COCO test-dev	EVA	APL	72.4	# 1
Object Detection	COCO test-dev	EVA	box mAP	64.7	# 7
Object Detection	COCO test-dev	EVA	AP50	81.9	# 1
Object Detection	COCO test-dev	EVA	AP75	71.7	# 1
Object Detection	COCO test-dev	EVA	APS	48.5	# 1
Object Detection	COCO test-dev	EVA	APM	67.7	# 1
Object Detection	COCO test-dev	EVA	APL	77.9	# 1
Image Classification	ImageNet	EVA	Top 1 Accuracy	89.7%	# 23
Image Classification	ImageNet	EVA	Number of params	1000M	# 956
Image Classification	ImageNet	EVA (EVA-CLIP)	Number of params	1B	# 2
Self-Supervised Image Classification (with CLIP)	ImageNet (zero-shot)	EVA (EVA-CLIP)	Top-1 Accuracy	78.5%	# 1
Action Classification	Kinetics-400	EVA	Acc@1	89.7	# 13
Action Classification	Kinetics-600	EVA	Top-1 Accuracy	89.8%	# 12
Action Classification	Kinetics-700	EVA	Top-1 Accuracy	82.9%	# 7
Object Detection	LVIS v1.0 val	EVA	box AP	62.2	# 3
Object Detection	LVIS v1.0 val	EVA	box APr	55.1	# 2
Instance Segmentation	LVIS v1.0 val	EVA	mask AP	55.0	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/eva-exploring-the-limits-of-masked-visual/object-detection-on-coco-o)](https://paperswithcode.com/sota/object-detection-on-coco-o?p=eva-exploring-the-limits-of-masked-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/eva-exploring-the-limits-of-masked-visual/semantic-segmentation-on-coco-stuff-test)](https://paperswithcode.com/sota/semantic-segmentation-on-coco-stuff-test?p=eva-exploring-the-limits-of-masked-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/eva-exploring-the-limits-of-masked-visual/instance-segmentation-on-coco)](https://paperswithcode.com/sota/instance-segmentation-on-coco?p=eva-exploring-the-limits-of-masked-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/eva-exploring-the-limits-of-masked-visual/self-supervised-image-classification-with)](https://paperswithcode.com/sota/self-supervised-image-classification-with?p=eva-exploring-the-limits-of-masked-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/eva-exploring-the-limits-of-masked-visual/instance-segmentation-on-coco-minival)](https://paperswithcode.com/sota/instance-segmentation-on-coco-minival?p=eva-exploring-the-limits-of-masked-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/eva-exploring-the-limits-of-masked-visual/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=eva-exploring-the-limits-of-masked-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/eva-exploring-the-limits-of-masked-visual/instance-segmentation-on-lvis-v1-0-val)](https://paperswithcode.com/sota/instance-segmentation-on-lvis-v1-0-val?p=eva-exploring-the-limits-of-masked-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/eva-exploring-the-limits-of-masked-visual/semantic-segmentation-on-ade20k-val)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k-val?p=eva-exploring-the-limits-of-masked-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/eva-exploring-the-limits-of-masked-visual/object-detection-on-lvis-v1-0-val)](https://paperswithcode.com/sota/object-detection-on-lvis-v1-0-val?p=eva-exploring-the-limits-of-masked-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/eva-exploring-the-limits-of-masked-visual/semantic-segmentation-on-ade20k)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=eva-exploring-the-limits-of-masked-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/eva-exploring-the-limits-of-masked-visual/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=eva-exploring-the-limits-of-masked-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/eva-exploring-the-limits-of-masked-visual/object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-coco?p=eva-exploring-the-limits-of-masked-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/eva-exploring-the-limits-of-masked-visual/action-classification-on-kinetics-700)](https://paperswithcode.com/sota/action-classification-on-kinetics-700?p=eva-exploring-the-limits-of-masked-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/eva-exploring-the-limits-of-masked-visual/action-classification-on-kinetics-600)](https://paperswithcode.com/sota/action-classification-on-kinetics-600?p=eva-exploring-the-limits-of-masked-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/eva-exploring-the-limits-of-masked-visual/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=eva-exploring-the-limits-of-masked-visual)`

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

CVPR 2023 · Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, Yue Cao ·

We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. To facilitate future research, we release all the code and models at https://github.com/baaivision/EVA.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Code

Add Remove Mark official

rwightman/pytorch-image-models official

29,774

baaivision/eva official

1,962

open-mmlab/mmselfsup

3,083

leondgarse/keras_cv_attention_models

556

PaddlePaddle/PASSL

259

See all 6 implementations

Tasks

Add Remove

Action Classification

Action Recognition

Image Classification

Instance Segmentation

Object Detection

Representation Learning

Segmentation

Self-Supervised Image Classification

Self-Supervised Image Classification (with CLIP)

Semantic Segmentation

Temporal Action Localization

Transfer Learning

Datasets

CIFAR-10

ImageNet

MS COCO

CIFAR-100

UCF101

Kinetics

ADE20K

Kinetics 400

LVIS

COCO-Stuff

ImageNet-Sketch

Objects365

LAION-400M

Kinetics-600

CC12M

Kinetics-700

COCO-O JFT-3B

Results from the Paper

Edit

Ranked #1 on Self-Supervised Image Classification (with CLIP) on ImageNet (zero-shot)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Segmentation	ADE20K	EVA	Validation mIoU	62.3	# 5	Compare
Semantic Segmentation	ADE20K	EVA	Params (M)	1074	# 7	Compare
Semantic Segmentation	ADE20K val	EVA	mIoU	61.5	# 3	Compare
Instance Segmentation	COCO minival	EVA	mask AP	55.0	# 2	Compare
			AP50	79.4	# 2	Compare
			AP75	60.9	# 2	Compare
			APL	72.0	# 3	Compare
			APM	58.4	# 1	Compare
			APS	37.6	# 2	Compare
Object Detection	COCO minival	EVA	box AP	64.5	# 6	Compare
			AP50	82.1	# 1	Compare
			AP75	70.8	# 1	Compare
			APS	49.4	# 1	Compare
			APM	68.4	# 1	Compare
			APL	78.5	# 1	Compare
Object Detection	COCO-O	EVA	Average mAP	57.8	# 1	Compare
Object Detection	COCO-O	EVA	Effective Robustness	28.86	# 1	Compare
Semantic Segmentation	COCO-Stuff test	EVA	mIoU	53.4%	# 1	Compare
Instance Segmentation	COCO test-dev	EVA	mask AP	55.5	# 1	Compare
			AP50	80.0	# 2	Compare
			APS	36.3	# 3	Compare
			APM	58.0	# 3	Compare
			APL	72.4	# 1	Compare
Object Detection	COCO test-dev	EVA	box mAP	64.7	# 7	Compare
			AP50	81.9	# 1	Compare
			AP75	71.7	# 1	Compare
			APS	48.5	# 1	Compare
			APM	67.7	# 1	Compare
			APL	77.9	# 1	Compare
Image Classification	ImageNet	EVA	Top 1 Accuracy	89.7%	# 23	Compare
Image Classification	ImageNet	EVA	Number of params	1000M	# 956	Compare
Image Classification	ImageNet	EVA (EVA-CLIP)	Number of params	1B	# 2	Compare
Self-Supervised Image Classification (with CLIP)	ImageNet (zero-shot)	EVA (EVA-CLIP)	Top-1 Accuracy	78.5%	# 1	Compare
Action Classification	Kinetics-400	EVA	Acc@1	89.7	# 13	Compare
Action Classification	Kinetics-600	EVA	Top-1 Accuracy	89.8%	# 12	Compare
Action Classification	Kinetics-700	EVA	Top-1 Accuracy	82.9%	# 7	Compare
Object Detection	LVIS v1.0 val	EVA	box AP	62.2	# 3	Compare
Object Detection	LVIS v1.0 val	EVA	box APr	55.1	# 2	Compare
Instance Segmentation	LVIS v1.0 val	EVA	mask AP	55.0	# 2	Compare

Methods

Add Remove

CLIP

Edit Social Preview

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove