TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Retrieval	COCO-CN	R2D2 (ViT-L/14)	R@1	79.1	# 3
Image Retrieval	COCO-CN	R2D2 (ViT-L/14)	R@5	96.5	# 4
Image Retrieval	COCO-CN	R2D2 (ViT-L/14)	R@10	98.9	# 5
Image Retrieval	COCO-CN	R2D2 (ViT-B)	R@1	75.1	# 6
Image Retrieval	COCO-CN	R2D2 (ViT-B)	R@5	94.2	# 7
Image Retrieval	COCO-CN	R2D2 (ViT-B)	R@10	98.1	# 6
Zero-shot Image Retrieval	COCO-CN	R2D2 (ViT-L/14)	R@1	56.4	# 9
Zero-shot Image Retrieval	COCO-CN	R2D2 (ViT-L/14)	R@5	85.0	# 9
Zero-shot Image Retrieval	COCO-CN	R2D2 (ViT-L/14)	R@10	93.1	# 9
Zero-shot Image Retrieval	Flickr30k-CN	R2D2 (ViT-L/14)	R@1	60.9	# 10
Zero-shot Image Retrieval	Flickr30k-CN	R2D2 (ViT-L/14)	R@5	86.8	# 10
Zero-shot Image Retrieval	Flickr30k-CN	R2D2 (ViT-L/14)	R@10	92.7	# 10
Image Retrieval	Flickr30k-CN	R2D2 (ViT-L/14)	R@1	84.4	# 3
Image Retrieval	Flickr30k-CN	R2D2 (ViT-L/14)	R@5	96.7	# 5
Image Retrieval	Flickr30k-CN	R2D2 (ViT-L/14)	R@10	98.4	# 4
Image Retrieval	Flickr30k-CN	R2D2 (ViT-B)	R@1	78.3	# 8
Image Retrieval	Flickr30k-CN	R2D2 (ViT-B)	R@5	94.6	# 8
Image Retrieval	Flickr30k-CN	R2D2 (ViT-B)	R@10	97.0	# 7
Image Retrieval	MUGE Retrieval	R2D2 (ViT-B)	R@1	47.4	# 8
Image Retrieval	MUGE Retrieval	R2D2 (ViT-B)	R@5	75.1	# 7
Image Retrieval	MUGE Retrieval	R2D2 (ViT-B)	R@10	83.5	# 8
Image Retrieval	MUGE Retrieval	R2D2 (ViT-B)	Mean Recall	68.7	# 8
Zero-shot Image Retrieval	MUGE Retrieval	R2D2 (ViT-L/14)	R@1	49.5	# 5
Zero-shot Image Retrieval	MUGE Retrieval	R2D2 (ViT-L/14)	R@5	75.7	# 5
Zero-shot Image Retrieval	MUGE Retrieval	R2D2 (ViT-L/14)	R@10	83.2	# 5
Zero-shot Image Retrieval	MUGE Retrieval	R2D2 (ViT-L/14)	Mean Recall	69.5	# 5
Image Retrieval	MUGE Retrieval	R2D2 (ViT-L/14)	R@1	60.1	# 4
Image Retrieval	MUGE Retrieval	R2D2 (ViT-L/14)	R@5	82.9	# 5
Image Retrieval	MUGE Retrieval	R2D2 (ViT-L/14)	R@10	89.4	# 5
Image Retrieval	MUGE Retrieval	R2D2 (ViT-L/14)	Mean Recall	77.5	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zero-and-r2d2-a-large-scale-chinese-cross/image-retrieval-on-coco-cn)](https://paperswithcode.com/sota/image-retrieval-on-coco-cn?p=zero-and-r2d2-a-large-scale-chinese-cross)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zero-and-r2d2-a-large-scale-chinese-cross/image-retrieval-on-flickr30k-cn)](https://paperswithcode.com/sota/image-retrieval-on-flickr30k-cn?p=zero-and-r2d2-a-large-scale-chinese-cross)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zero-and-r2d2-a-large-scale-chinese-cross/image-retrieval-on-muge-retrieval)](https://paperswithcode.com/sota/image-retrieval-on-muge-retrieval?p=zero-and-r2d2-a-large-scale-chinese-cross)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zero-and-r2d2-a-large-scale-chinese-cross/zero-shot-image-retrieval-on-muge-retrieval)](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-muge-retrieval?p=zero-and-r2d2-a-large-scale-chinese-cross)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zero-and-r2d2-a-large-scale-chinese-cross/zero-shot-image-retrieval-on-coco-cn)](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-coco-cn?p=zero-and-r2d2-a-large-scale-chinese-cross)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zero-and-r2d2-a-large-scale-chinese-cross/zero-shot-image-retrieval-on-flickr30k-cn)](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-flickr30k-cn?p=zero-and-r2d2-a-large-scale-chinese-cross)`

CCMB: A Large-scale Chinese Cross-modal Benchmark

8 May 2022 · Chunyu Xie, Heng Cai, Jincheng Li, Fanjing Kong, Xiaoyu Wu, Jianfei Song, Henrique Morimitsu, Lin Yao, Dexin Wang, Xiangzheng Zhang, Dawei Leng, Baochang Zhang, Xiangyang Ji, Yafeng Deng ·

Vision-language pre-training (VLP) on large-scale datasets has shown premier performance on various downstream tasks. In contrast to plenty of available benchmarks with English corpus, large-scale pre-training datasets and downstream datasets with Chinese corpus remain largely unexplored. In this work, we build a large-scale high-quality Chinese Cross-Modal Benchmark named CCMB for the research community, which contains the currently largest public pre-training dataset Zero and five human-annotated fine-tuning datasets for downstream tasks. Zero contains 250 million images paired with 750 million text descriptions, plus two of the five fine-tuning datasets are also currently the largest ones for Chinese cross-modal downstream tasks. Along with the CCMB, we also develop a VLP framework named R2D2, applying a pre-Ranking + Ranking strategy to learn powerful vision-language representations and a two-way distillation method (i.e., target-guided Distillation and feature-guided Distillation) to further enhance the learning capability. With the Zero and the R2D2 VLP framework, we achieve state-of-the-art performance on twelve downstream datasets from five broad categories of tasks including image-text retrieval, image-text matching, image caption, text-to-image generation, and zero-shot image classification. The datasets, models, and codes are available at https://github.com/yuxie11/R2D2

PDF Abstract

Code

Add Remove Mark official

yuxie11/R2D2 official

150

Tasks

Add Remove

Image Classification

Image Generation

Image Retrieval

Image-text matching

Retrieval

Text Matching

Text Retrieval

Text-to-Image Generation

Zero-Shot Image Classification

Zero-shot Image Retrieval

Datasets

Introduced in the Paper:

Flickr30k-CNA

IQM

ICM

IQR

ICR

Used in the Paper:

ImageNet

Flickr30k

COCO-CN

Results from the Paper

Edit

Ranked #3 on Image Retrieval on Flickr30k-CN

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Retrieval	COCO-CN	R2D2 (ViT-L/14)	R@1	79.1	# 3	Compare
			R@5	96.5	# 4	Compare
			R@10	98.9	# 5	Compare
Image Retrieval	COCO-CN	R2D2 (ViT-B)	R@1	75.1	# 6	Compare
			R@5	94.2	# 7	Compare
			R@10	98.1	# 6	Compare
Zero-shot Image Retrieval	COCO-CN	R2D2 (ViT-L/14)	R@1	56.4	# 9	Compare
			R@5	85.0	# 9	Compare
			R@10	93.1	# 9	Compare
Zero-shot Image Retrieval	Flickr30k-CN	R2D2 (ViT-L/14)	R@1	60.9	# 10	Compare
			R@5	86.8	# 10	Compare
			R@10	92.7	# 10	Compare
Image Retrieval	Flickr30k-CN	R2D2 (ViT-L/14)	R@1	84.4	# 3	Compare
			R@5	96.7	# 5	Compare
			R@10	98.4	# 4	Compare
Image Retrieval	Flickr30k-CN	R2D2 (ViT-B)	R@1	78.3	# 8	Compare
			R@5	94.6	# 8	Compare
			R@10	97.0	# 7	Compare
Image Retrieval	MUGE Retrieval	R2D2 (ViT-B)	R@1	47.4	# 8	Compare
			R@5	75.1	# 7	Compare
			R@10	83.5	# 8	Compare
			Mean Recall	68.7	# 8	Compare
Zero-shot Image Retrieval	MUGE Retrieval	R2D2 (ViT-L/14)	R@1	49.5	# 5	Compare
			R@5	75.7	# 5	Compare
			R@10	83.2	# 5	Compare
			Mean Recall	69.5	# 5	Compare
Image Retrieval	MUGE Retrieval	R2D2 (ViT-L/14)	R@1	60.1	# 4	Compare
			R@5	82.9	# 5	Compare
			R@10	89.4	# 5	Compare
			Mean Recall	77.5	# 4	Compare

Methods

Add Remove

R2D2

Edit Social Preview

CCMB: A Large-scale Chinese Cross-modal Benchmark

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove