We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Classification ImageNet AIMv2-2B Number of params 2700M # 1045
Image Classification ImageNet AIMv2-3B Top 1 Accuracy 88.5% # 49
Image Classification ImageNet AIMv2-1B Top 1 Accuracy 88.1% # 67
Number of params 1200M # 1031
Image Classification ImageNet AIMv2-H Top 1 Accuracy 87.5% # 89
Number of params 600M # 1013
Image Classification ImageNet AIMv2-L Top 1 Accuracy 86.6% # 140
Number of params 300M # 982
Image Classification ImageNet AIMv2-3B (448 res) Top 1 Accuracy 89.5% # 25
Image Classification iNaturalist AIMv2-1B Top 1 Accuracy 79.7 # 7
Image Classification iNaturalist AIMv2-3B Top 1 Accuracy 81.5 # 5
Image Classification iNaturalist AIMv2-3B (448 res) Top 1 Accuracy 85.9 # 1
Image Classification iNaturalist AIMv2-H Top 1 Accuracy 77.9 # 8
Image Classification iNaturalist AIMv2-L Top 1 Accuracy 76 # 9

Methods