CCMB: A Large-scale Chinese Cross-modal Benchmark

Vision-language pre-training (VLP) on large-scale datasets has shown premier performance on various downstream tasks. In contrast to plenty of available benchmarks with English corpus, large-scale pre-training datasets and downstream datasets with Chinese corpus remain largely unexplored. In this work, we build a large-scale high-quality Chinese Cross-Modal Benchmark named CCMB for the research community, which contains the currently largest public pre-training dataset Zero and five human-annotated fine-tuning datasets for downstream tasks. Zero contains 250 million images paired with 750 million text descriptions, plus two of the five fine-tuning datasets are also currently the largest ones for Chinese cross-modal downstream tasks. Along with the CCMB, we also develop a VLP framework named R2D2, applying a pre-Ranking + Ranking strategy to learn powerful vision-language representations and a two-way distillation method (i.e., target-guided Distillation and feature-guided Distillation) to further enhance the learning capability. With the Zero and the R2D2 VLP framework, we achieve state-of-the-art performance on twelve downstream datasets from five broad categories of tasks including image-text retrieval, image-text matching, image caption, text-to-image generation, and zero-shot image classification. The datasets, models, and codes are available at https://github.com/yuxie11/R2D2

PDF Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Retrieval COCO-CN R2D2 (ViT-L/14) R@1 79.1 # 3
R@5 96.5 # 4
R@10 98.9 # 5
Image Retrieval COCO-CN R2D2 (ViT-B) R@1 75.1 # 6
R@5 94.2 # 7
R@10 98.1 # 6
Zero-shot Image Retrieval COCO-CN R2D2 (ViT-L/14) R@1 56.4 # 9
R@5 85.0 # 9
R@10 93.1 # 9
Zero-shot Image Retrieval Flickr30k-CN R2D2 (ViT-L/14) R@1 60.9 # 10
R@5 86.8 # 10
R@10 92.7 # 10
Image Retrieval Flickr30k-CN R2D2 (ViT-L/14) R@1 84.4 # 3
R@5 96.7 # 5
R@10 98.4 # 4
Image Retrieval Flickr30k-CN R2D2 (ViT-B) R@1 78.3 # 8
R@5 94.6 # 8
R@10 97.0 # 7
Image Retrieval MUGE Retrieval R2D2 (ViT-B) R@1 47.4 # 8
R@5 75.1 # 7
R@10 83.5 # 8
Mean Recall 68.7 # 8
Zero-shot Image Retrieval MUGE Retrieval R2D2 (ViT-L/14) R@1 49.5 # 5
R@5 75.7 # 5
R@10 83.2 # 5
Mean Recall 69.5 # 5
Image Retrieval MUGE Retrieval R2D2 (ViT-L/14) R@1 60.1 # 4
R@5 82.9 # 5
R@10 89.4 # 5
Mean Recall 77.5 # 4

Methods