TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Human Judgment Correlation	Flickr8k-CF	RefCLIP-S	Kendall's Tau-b	36.4	# 2
Human Judgment Correlation	Flickr8k-CF	CLIP-S	Kendall's Tau-b	34.4	# 3
Human Judgment Correlation	Flickr8k-Expert	RefCLIP-S	Kendall's Tau-c	53.0	# 3
Human Judgment Correlation	Flickr8k-Expert	CLIP-S	Kendall's Tau-c	51.2	# 4
Hallucination Pair-wise Detection (4-ref)	FOIL	RefCLIP-S	Mean Accuracy	92.6	# 1
Hallucination Pair-wise Detection (1-ref)	FOIL	CLIP-S	Mean Accuracy	91	# 1
Hallucination Pair-wise Detection (4-ref)	FOIL	CLIP-S	Mean Accuracy	87.2	# 3
Human Judgment Classification	Pascal-50S	CLIP-S	Mean Accuracy	80.7	# 3
Human Judgment Classification	Pascal-50S	RefCLIP-S	Mean Accuracy	83.1	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clipscore-a-reference-free-evaluation-metric/hallucination-pair-wise-detection-4-ref-on)](https://paperswithcode.com/sota/hallucination-pair-wise-detection-4-ref-on?p=clipscore-a-reference-free-evaluation-metric)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clipscore-a-reference-free-evaluation-metric/hallucination-pair-wise-detection-1-ref-on)](https://paperswithcode.com/sota/hallucination-pair-wise-detection-1-ref-on?p=clipscore-a-reference-free-evaluation-metric)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clipscore-a-reference-free-evaluation-metric/human-judgment-correlation-on-flickr8k-cf)](https://paperswithcode.com/sota/human-judgment-correlation-on-flickr8k-cf?p=clipscore-a-reference-free-evaluation-metric)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clipscore-a-reference-free-evaluation-metric/human-judgment-classification-on-pascal-50s)](https://paperswithcode.com/sota/human-judgment-classification-on-pascal-50s?p=clipscore-a-reference-free-evaluation-metric)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clipscore-a-reference-free-evaluation-metric/human-judgment-correlation-on-flickr8k-expert)](https://paperswithcode.com/sota/human-judgment-correlation-on-flickr8k-expert?p=clipscore-a-reference-free-evaluation-metric)`

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

EMNLP 2021 · Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, Yejin Choi ·

Image captioning has conventionally relied on reference-based automatic evaluations, where machine captions are compared against captions written by humans. This is in contrast to the reference-free manner in which humans assess caption quality. In this paper, we report the surprising empirical finding that CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references. Experiments spanning several corpora demonstrate that our new reference-free metric, CLIPScore, achieves the highest correlation with human judgements, outperforming existing reference-based metrics like CIDEr and SPICE. Information gain experiments demonstrate that CLIPScore, with its tight focus on image-text compatibility, is complementary to existing reference-based metrics that emphasize text-text similarities. Thus, we also present a reference-augmented version, RefCLIPScore, which achieves even higher correlation. Beyond literal description tasks, several case studies reveal domains where CLIPScore performs well (clip-art images, alt-text rating), but also where it is relatively weaker in comparison to reference-based metrics, e.g., news captions that require richer contextual knowledge.

PDF Abstract EMNLP 2021 PDF EMNLP 2021 Abstract

Code

Add Remove Mark official

jmhessel/clipscore official

156

showlab/loveu-tgve-2023

↳ Quickstart in

Colab

Spaces

jmhessel/pycocoevalcap

Tasks

Add Remove

Hallucination Pair-wise Detection (1-ref)

Hallucination Pair-wise Detection (4-ref)

Human Judgment Classification

Human Judgment Correlation

Image Captioning

Datasets

Add Datasets introduced or used in this paper

Results from the Paper

Edit

Ranked #1 on Hallucination Pair-wise Detection (4-ref) on FOIL

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Human Judgment Correlation	Flickr8k-CF	RefCLIP-S	Kendall's Tau-b	36.4	# 2	Compare
Human Judgment Correlation	Flickr8k-CF	CLIP-S	Kendall's Tau-b	34.4	# 3	Compare
Human Judgment Correlation	Flickr8k-Expert	RefCLIP-S	Kendall's Tau-c	53.0	# 3	Compare
Human Judgment Correlation	Flickr8k-Expert	CLIP-S	Kendall's Tau-c	51.2	# 4	Compare
Hallucination Pair-wise Detection (4-ref)	FOIL	RefCLIP-S	Mean Accuracy	92.6	# 1	Compare
Hallucination Pair-wise Detection (1-ref)	FOIL	CLIP-S	Mean Accuracy	91	# 1	Compare
Hallucination Pair-wise Detection (4-ref)	FOIL	CLIP-S	Mean Accuracy	87.2	# 3	Compare
Human Judgment Classification	Pascal-50S	CLIP-S	Mean Accuracy	80.7	# 3	Compare
Human Judgment Classification	Pascal-50S	RefCLIP-S	Mean Accuracy	83.1	# 2	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove