EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. To facilitate future research, we release all the code and models at https://github.com/baaivision/EVA.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Semantic Segmentation ADE20K EVA Validation mIoU 62.3 # 5
Params (M) 1074 # 7
Semantic Segmentation ADE20K val EVA mIoU 61.5 # 3
Instance Segmentation COCO minival EVA mask AP 55.0 # 2
AP50 79.4 # 2
AP75 60.9 # 2
APL 72.0 # 3
APM 58.4 # 1
APS 37.6 # 2
Object Detection COCO minival EVA box AP 64.5 # 6
AP50 82.1 # 1
AP75 70.8 # 1
APS 49.4 # 1
APM 68.4 # 1
APL 78.5 # 1
Object Detection COCO-O EVA Average mAP 57.8 # 1
Effective Robustness 28.86 # 1
Semantic Segmentation COCO-Stuff test EVA mIoU 53.4% # 1
Instance Segmentation COCO test-dev EVA mask AP 55.5 # 1
AP50 80.0 # 2
APS 36.3 # 3
APM 58.0 # 3
APL 72.4 # 1
Object Detection COCO test-dev EVA box mAP 64.7 # 7
AP50 81.9 # 1
AP75 71.7 # 1
APS 48.5 # 1
APM 67.7 # 1
APL 77.9 # 1
Image Classification ImageNet EVA Top 1 Accuracy 89.7% # 23
Number of params 1000M # 956
Image Classification ImageNet EVA (EVA-CLIP) Number of params 1B # 2
Self-Supervised Image Classification (with CLIP) ImageNet (zero-shot) EVA (EVA-CLIP) Top-1 Accuracy 78.5% # 1
Action Classification Kinetics-400 EVA Acc@1 89.7 # 13
Action Classification Kinetics-600 EVA Top-1 Accuracy 89.8% # 12
Action Classification Kinetics-700 EVA Top-1 Accuracy 82.9% # 7
Object Detection LVIS v1.0 val EVA box AP 62.2 # 3
box APr 55.1 # 2
Instance Segmentation LVIS v1.0 val EVA mask AP 55.0 # 2

Methods