CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Image captioning has conventionally relied on reference-based automatic evaluations, where machine captions are compared against captions written by humans. This is in contrast to the reference-free manner in which humans assess caption quality. In this paper, we report the surprising empirical finding that CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references. Experiments spanning several corpora demonstrate that our new reference-free metric, CLIPScore, achieves the highest correlation with human judgements, outperforming existing reference-based metrics like CIDEr and SPICE. Information gain experiments demonstrate that CLIPScore, with its tight focus on image-text compatibility, is complementary to existing reference-based metrics that emphasize text-text similarities. Thus, we also present a reference-augmented version, RefCLIPScore, which achieves even higher correlation. Beyond literal description tasks, several case studies reveal domains where CLIPScore performs well (clip-art images, alt-text rating), but also where it is relatively weaker in comparison to reference-based metrics, e.g., news captions that require richer contextual knowledge.

PDF Abstract EMNLP 2021 PDF EMNLP 2021 Abstract

Datasets


  Add Datasets introduced or used in this paper
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Human Judgment Correlation Flickr8k-CF RefCLIP-S Kendall's Tau-b 36.4 # 2
Human Judgment Correlation Flickr8k-CF CLIP-S Kendall's Tau-b 34.4 # 3
Human Judgment Correlation Flickr8k-Expert RefCLIP-S Kendall's Tau-c 53.0 # 2
Human Judgment Correlation Flickr8k-Expert CLIP-S Kendall's Tau-c 51.2 # 3
Hallucination Pair-wise Detection (4-ref) FOIL RefCLIP-S Mean Accuracy 92.6 # 1
Hallucination Pair-wise Detection (1-ref) FOIL CLIP-S Mean Accuracy 91 # 1
Hallucination Pair-wise Detection (4-ref) FOIL CLIP-S Mean Accuracy 87.2 # 3
Human Judgment Classification Pascal-50S CLIP-S Mean Accuracy 80.7 # 3
Human Judgment Classification Pascal-50S RefCLIP-S Mean Accuracy 83.1 # 2

Methods


No methods listed for this paper. Add relevant methods here