Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pre-training. However, these datasets are often collected with overrestrictive requirements inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pre-training data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [Sharma et al. 2018] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training. We perform an analysis of this dataset and benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. Our results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.

PDF Abstract CVPR 2021 PDF CVPR 2021 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Captioning nocaps-val-in-domain Enc-Dec CIDEr 92.6 # 11
SPICE 12.5 # 10
Pre-train (#images) 15M # 7
Image Captioning nocaps-val-near-domain Enc-Dec CIDEr 88.3 # 10
SPICE 12.1 # 9
Image Captioning nocaps-val-out-domain Enc-Dec CIDEr 94.5 # 9
SPICE 11.9 # 9
Image Captioning nocaps-val-overall Enc-Dec CIDEr 90.2 # 10
SPICE 12.1 # 9

Methods


No methods listed for this paper. Add relevant methods here