Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications. While existing vision foundation models such as CLIP, ALIGN, and Wu Dao 2.0 focus mainly on mapping images and textual representations to a cross-modal shared representation, we introduce a new computer vision foundation model, Florence, to expand the representations from coarse (scene) to fine (object), from static (images) to dynamic (videos), and from RGB to multiple modalities (caption, depth). By incorporating universal visual-language representations from Web-scale image-text data, our Florence model can be easily adapted for various computer vision tasks, such as classification, retrieval, object detection, VQA, image caption, video retrieval and action recognition. Moreover, Florence demonstrates outstanding performance in many types of transfer learning: fully sampled fine-tuning, linear probing, few-shot transfer and zero-shot transfer for novel images and objects. All of these properties are critical for our vision foundation model to serve general purpose vision tasks. Florence achieves new state-of-the-art results in majority of 44 representative benchmarks, e.g., ImageNet-1K zero-shot classification with top-1 accuracy of 83.74 and the top-5 accuracy of 97.18, 62.4 mAP on COCO fine tuning, 80.36 on VQA, and 87.8 on Kinetics-600.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Zero-Shot Cross-Modal Retrieval COCO 2014 Florence Image-to-text R@1 64.7 # 10
Image-to-text R@5 85.9 # 10
Text-to-image R@1 47.2 # 10
Text-to-image R@5 71.4 # 10
Cross-Modal Retrieval COCO 2014 Florence Image-to-text R@1 81.8 # 6
Image-to-text R@5 95.2 # 10
Text-to-image R@1 63.2 # 11
Text-to-image R@5 85.7 # 10
Object Detection COCO minival Florence-CoSwin-H box AP 62 # 14
Object Detection COCO test-dev Florence-CoSwin-H box mAP 62.4 # 18
Zero-Shot Cross-Modal Retrieval Flickr30k Florence Image-to-text R@1 90.9 # 8
Image-to-text R@5 99.1 # 9
Image-to-text R@10 - # 18
Text-to-image R@1 76.7 # 12
Text-to-image R@5 93.6 # 13
Text-to-image R@10 - # 18
Image Classification ImageNet Florence-CoSwin-H Top 1 Accuracy 90.05% # 18
Number of params 893M # 954
Zero-Shot Transfer Image Classification ImageNet Florence-CoSwin-H (@384pix) Accuracy (Private) 83.7 # 10
Action Recognition In Videos Kinetics-400 Florence Top-1 Accuracy 86.5 # 1
Top-5 Accuracy 97.3 # 1
Action Classification Kinetics-600 Florence (curated FLD-900M pretrain) Top-1 Accuracy 87.8 # 25
Top-5 Accuracy 97.9 # 10
Action Recognition In Videos Kinetics-600 Florence Top-1 Accuracy 87.8 # 1
Top-5 Accuracy 97.8 # 1
Zero-Shot Video Retrieval MSR-VTT Florence text-to-video R@1 37.6 # 11
text-to-video R@5 63.8 # 10
text-to-video R@10 72.6 # 9
Video Retrieval MSR-VTT-1kA Florence text-to-video R@1 37.6 # 40
text-to-video R@5 63.8 # 40
text-to-video R@10 72.6 # 45
Visual Question Answering (VQA) VQA v2 test-dev Florence Accuracy 80.16 # 13
Visual Question Answering (VQA) VQA v2 test-std Florence overall 80.36 # 6

Methods