ClipCap: CLIP Prefix for Image Captioning

18 Nov 2021  ·  Ron Mokady, Amir Hertz, Amit H. Bermano ·

Image captioning is a fundamental task in vision-language understanding, where the model predicts a textual informative caption to a given input image. In this paper, we present a simple approach to address this task. We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions. The recently proposed CLIP model contains rich semantic features which were trained with textual context, making it best for vision-language perception. Our key idea is that together with a pre-trained language model (GPT2), we obtain a wide understanding of both visual and textual data. Hence, our approach only requires rather quick training to produce a competent captioning model. Without additional annotations or pre-training, it efficiently generates meaningful captions for large-scale and diverse datasets. Surprisingly, our method works well even when only the mapping network is trained, while both CLIP and the language model remain frozen, allowing a lighter architecture with less trainable parameters. Through quantitative evaluation, we demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets, while it is simpler, faster, and lighter. Our code is available in https://github.com/rmokady/CLIP_prefix_caption.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Captioning COCO Captions ClipCap (Transformer) BLEU-4 33.53 # 29
METEOR 27.45 # 24
CIDER 113.08 # 29
SPICE 21.05 # 24
Image Captioning COCO Captions ClipCap (MLP + GPT2 tuning) BLEU-4 32.15 # 30
METEOR 27.1 # 25
CIDER 108.35 # 30
SPICE 20.12 # 25
Image Captioning Conceptual Captions ClipCap (MLP + GPT2 tuning) ROUGE-L 26.71 # 1
CIDEr 87.26 # 1
SPICE 18.5 # 1
Image Captioning Conceptual Captions ClipCap (Transformer) ROUGE-L 25.12 # 2
CIDEr 71.82 # 2
SPICE 16.07 # 2
Image Captioning nocaps entire ClipCap (Transformer) CIDEr 65.83 # 26
SPICE 10.86 # 27
Image Captioning nocaps entire ClipCap (MLP + GPT2 tuning) CIDEr 65.7 # 27
SPICE 11.1 # 25
Image Captioning nocaps in-domain ClipCap (Transformer) CIDEr 84.85 # 23
SPICE 12.14 # 26
Image Captioning nocaps in-domain ClipCap (MLP + GPT2 tuning) CIDEr 79.73 # 28
SPICE 12.2 # 25
Image Captioning nocaps near-domain ClipCap (Transformer) CIDEr 66.82 # 27
SPICE 10.92 # 29
Image Captioning nocaps near-domain ClipCap (MLP + GPT2 tuning) CIDEr 67.69 # 26
SPICE 11.26 # 27
Image Captioning nocaps out-of-domain ClipCap (MLP + GPT2 tuning) CIDEr 49.35 # 28
SPICE 9.7 # 27
Image Captioning nocaps out-of-domain ClipCap (Transformer) CIDEr 49.14 # 29
SPICE 9.57 # 28

Methods