TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Cross-Modal Retrieval	COCO 2014	Florence	Image-to-text R@1	64.7	# 10
Zero-Shot Cross-Modal Retrieval	COCO 2014	Florence	Image-to-text R@5	85.9	# 10
Zero-Shot Cross-Modal Retrieval	COCO 2014	Florence	Text-to-image R@1	47.2	# 10
Zero-Shot Cross-Modal Retrieval	COCO 2014	Florence	Text-to-image R@5	71.4	# 10
Cross-Modal Retrieval	COCO 2014	Florence	Image-to-text R@1	81.8	# 6
Cross-Modal Retrieval	COCO 2014	Florence	Image-to-text R@5	95.2	# 10
Cross-Modal Retrieval	COCO 2014	Florence	Text-to-image R@1	63.2	# 11
Cross-Modal Retrieval	COCO 2014	Florence	Text-to-image R@5	85.7	# 10
Object Detection	COCO minival	Florence-CoSwin-H	box AP	62	# 14
Object Detection	COCO test-dev	Florence-CoSwin-H	box mAP	62.4	# 18
Zero-Shot Cross-Modal Retrieval	Flickr30k	Florence	Image-to-text R@1	90.9	# 8
Zero-Shot Cross-Modal Retrieval	Flickr30k	Florence	Image-to-text R@5	99.1	# 9
Zero-Shot Cross-Modal Retrieval	Flickr30k	Florence	Image-to-text R@10	-	# 18
Zero-Shot Cross-Modal Retrieval	Flickr30k	Florence	Text-to-image R@1	76.7	# 12
Zero-Shot Cross-Modal Retrieval	Flickr30k	Florence	Text-to-image R@5	93.6	# 13
Zero-Shot Cross-Modal Retrieval	Flickr30k	Florence	Text-to-image R@10	-	# 18
Image Classification	ImageNet	Florence-CoSwin-H	Top 1 Accuracy	90.05%	# 18
Image Classification	ImageNet	Florence-CoSwin-H	Number of params	893M	# 954
Zero-Shot Transfer Image Classification	ImageNet	Florence-CoSwin-H (@384pix)	Accuracy (Private)	83.7	# 10
Action Recognition In Videos	Kinetics-400	Florence	Top-1 Accuracy	86.5	# 1
Action Recognition In Videos	Kinetics-400	Florence	Top-5 Accuracy	97.3	# 1
Action Classification	Kinetics-600	Florence (curated FLD-900M pretrain)	Top-1 Accuracy	87.8	# 25
Action Classification	Kinetics-600	Florence (curated FLD-900M pretrain)	Top-5 Accuracy	97.9	# 10
Action Recognition In Videos	Kinetics-600	Florence	Top-1 Accuracy	87.8	# 1
Action Recognition In Videos	Kinetics-600	Florence	Top-5 Accuracy	97.8	# 1
Zero-Shot Video Retrieval	MSR-VTT	Florence	text-to-video R@1	37.6	# 11
Zero-Shot Video Retrieval	MSR-VTT	Florence	text-to-video R@5	63.8	# 10
Zero-Shot Video Retrieval	MSR-VTT	Florence	text-to-video R@10	72.6	# 9
Video Retrieval	MSR-VTT-1kA	Florence	text-to-video R@1	37.6	# 40
Video Retrieval	MSR-VTT-1kA	Florence	text-to-video R@5	63.8	# 40
Video Retrieval	MSR-VTT-1kA	Florence	text-to-video R@10	72.6	# 45
Visual Question Answering (VQA)	VQA v2 test-dev	Florence	Accuracy	80.16	# 13
Visual Question Answering (VQA)	VQA v2 test-std	Florence	overall	80.36	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/florence-a-new-foundation-model-for-computer/action-recognition-in-videos-on-kinetics-400-1)](https://paperswithcode.com/sota/action-recognition-in-videos-on-kinetics-400-1?p=florence-a-new-foundation-model-for-computer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/florence-a-new-foundation-model-for-computer/action-recognition-in-videos-on-kinetics-600)](https://paperswithcode.com/sota/action-recognition-in-videos-on-kinetics-600?p=florence-a-new-foundation-model-for-computer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/florence-a-new-foundation-model-for-computer/visual-question-answering-on-vqa-v2-test-std)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-test-std?p=florence-a-new-foundation-model-for-computer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/florence-a-new-foundation-model-for-computer/zero-shot-cross-modal-retrieval-on-flickr30k)](https://paperswithcode.com/sota/zero-shot-cross-modal-retrieval-on-flickr30k?p=florence-a-new-foundation-model-for-computer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/florence-a-new-foundation-model-for-computer/zero-shot-cross-modal-retrieval-on-coco-2014)](https://paperswithcode.com/sota/zero-shot-cross-modal-retrieval-on-coco-2014?p=florence-a-new-foundation-model-for-computer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/florence-a-new-foundation-model-for-computer/zero-shot-transfer-image-classification-on-1)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-1?p=florence-a-new-foundation-model-for-computer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/florence-a-new-foundation-model-for-computer/cross-modal-retrieval-on-coco-2014)](https://paperswithcode.com/sota/cross-modal-retrieval-on-coco-2014?p=florence-a-new-foundation-model-for-computer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/florence-a-new-foundation-model-for-computer/zero-shot-video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msr-vtt?p=florence-a-new-foundation-model-for-computer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/florence-a-new-foundation-model-for-computer/visual-question-answering-on-vqa-v2-test-dev)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-test-dev?p=florence-a-new-foundation-model-for-computer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/florence-a-new-foundation-model-for-computer/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=florence-a-new-foundation-model-for-computer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/florence-a-new-foundation-model-for-computer/object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-coco?p=florence-a-new-foundation-model-for-computer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/florence-a-new-foundation-model-for-computer/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=florence-a-new-foundation-model-for-computer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/florence-a-new-foundation-model-for-computer/action-classification-on-kinetics-600)](https://paperswithcode.com/sota/action-classification-on-kinetics-600?p=florence-a-new-foundation-model-for-computer)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/florence-a-new-foundation-model-for-computer/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=florence-a-new-foundation-model-for-computer)`

Florence: A New Foundation Model for Computer Vision

22 Nov 2021 · Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, JianFeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, Pengchuan Zhang ·

Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications. While existing vision foundation models such as CLIP, ALIGN, and Wu Dao 2.0 focus mainly on mapping images and textual representations to a cross-modal shared representation, we introduce a new computer vision foundation model, Florence, to expand the representations from coarse (scene) to fine (object), from static (images) to dynamic (videos), and from RGB to multiple modalities (caption, depth). By incorporating universal visual-language representations from Web-scale image-text data, our Florence model can be easily adapted for various computer vision tasks, such as classification, retrieval, object detection, VQA, image caption, video retrieval and action recognition. Moreover, Florence demonstrates outstanding performance in many types of transfer learning: fully sampled fine-tuning, linear probing, few-shot transfer and zero-shot transfer for novel images and objects. All of these properties are critical for our vision foundation model to serve general purpose vision tasks. Florence achieves new state-of-the-art results in majority of 44 representative benchmarks, e.g., ImageNet-1K zero-shot classification with top-1 accuracy of 83.74 and the top-5 accuracy of 97.18, 62.4 mAP on COCO fine tuning, 80.36 on VQA, and 87.8 on Kinetics-600.

PDF Abstract

Code

Add Remove Mark official

microsoft/unicl

↳ Quickstart in

Spaces

368

Tasks

Add Remove

Action Classification

Action Recognition

Action Recognition In Videos

Cross-Modal Retrieval

Image Classification

object-detection

Object Detection

Retrieval

Transfer Learning

Video Retrieval

Visual Question Answering (VQA)

Zero-Shot Cross-Modal Retrieval

Zero-Shot Learning

Zero-Shot Transfer Image Classification

Zero-Shot Video Retrieval

Datasets

ImageNet

MS COCO

Kinetics

Visual Genome

Flickr30k

Kinetics 400

AudioSet

MSR-VTT

EuroSAT

Visual Question Answering v2.0

HowTo100M

WebVid

Kinetics-600

Results from the Paper

Edit

Ranked #1 on Action Recognition In Videos on Kinetics-600

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Cross-Modal Retrieval	COCO 2014	Florence	Image-to-text R@1	64.7	# 10	Compare
			Image-to-text R@5	85.9	# 10	Compare
			Text-to-image R@1	47.2	# 10	Compare
			Text-to-image R@5	71.4	# 10	Compare
Cross-Modal Retrieval	COCO 2014	Florence	Image-to-text R@1	81.8	# 6	Compare
			Image-to-text R@5	95.2	# 10	Compare
			Text-to-image R@1	63.2	# 11	Compare
			Text-to-image R@5	85.7	# 10	Compare
Object Detection	COCO minival	Florence-CoSwin-H	box AP	62	# 14	Compare
Object Detection	COCO test-dev	Florence-CoSwin-H	box mAP	62.4	# 18	Compare
Zero-Shot Cross-Modal Retrieval	Flickr30k	Florence	Image-to-text R@1	90.9	# 8	Compare
			Image-to-text R@5	99.1	# 9	Compare
			Image-to-text R@10	-	# 18	Compare
			Text-to-image R@1	76.7	# 12	Compare
			Text-to-image R@5	93.6	# 13	Compare
			Text-to-image R@10	-	# 18	Compare
Image Classification	ImageNet	Florence-CoSwin-H	Top 1 Accuracy	90.05%	# 18	Compare
Image Classification	ImageNet	Florence-CoSwin-H	Number of params	893M	# 954	Compare
Zero-Shot Transfer Image Classification	ImageNet	Florence-CoSwin-H (@384pix)	Accuracy (Private)	83.7	# 10	Compare
Action Recognition In Videos	Kinetics-400	Florence	Top-1 Accuracy	86.5	# 1	Compare
Action Recognition In Videos	Kinetics-400	Florence	Top-5 Accuracy	97.3	# 1	Compare
Action Classification	Kinetics-600	Florence (curated FLD-900M pretrain)	Top-1 Accuracy	87.8	# 25	Compare
Action Classification	Kinetics-600	Florence (curated FLD-900M pretrain)	Top-5 Accuracy	97.9	# 10	Compare
Action Recognition In Videos	Kinetics-600	Florence	Top-1 Accuracy	87.8	# 1	Compare
Action Recognition In Videos	Kinetics-600	Florence	Top-5 Accuracy	97.8	# 1	Compare
Zero-Shot Video Retrieval	MSR-VTT	Florence	text-to-video R@1	37.6	# 11	Compare
			text-to-video R@5	63.8	# 10	Compare
			text-to-video R@10	72.6	# 9	Compare
Video Retrieval	MSR-VTT-1kA	Florence	text-to-video R@1	37.6	# 40	Compare
			text-to-video R@5	63.8	# 40	Compare
			text-to-video R@10	72.6	# 45	Compare
Visual Question Answering (VQA)	VQA v2 test-dev	Florence	Accuracy	80.16	# 13	Compare
Visual Question Answering (VQA)	VQA v2 test-std	Florence	overall	80.36	# 6	Compare

Methods

Add Remove

ALIGN • CLIP • Florence

Edit Social Preview

Florence: A New Foundation Model for Computer Vision

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove