Simple Token-Level Confidence Improves Caption Correctness

The ability to judge whether a caption correctly describes an image is a critical part of vision-language understanding. However, state-of-the-art models often misinterpret the correctness of fine-grained details, leading to errors in outputs such as hallucinating objects in generated captions or poor compositional reasoning. In this work, we explore Token-Level Confidence, or TLC, as a simple yet surprisingly effective method to assess caption correctness. Specifically, we fine-tune a vision-language model on image captioning, input an image and proposed caption to the model, and aggregate either algebraic or learned token confidences over words or sequences to estimate image-caption consistency. Compared to sequence-level scores from pretrained models, TLC with algebraic confidence measures achieves a relative improvement in accuracy by 10% on verb understanding in SVO-Probes and outperforms prior state-of-the-art in image and group scores for compositional reasoning in Winoground by a relative 37% and 9%, respectively. When training data are available, a learned confidence estimator provides further improved performance, reducing object hallucination rates in MS COCO Captions by a relative 30% over the original model and setting a new state-of-the-art.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Visual Reasoning Winoground OFA tiny (ITM) Text Score 22.75 # 92
Image Score 7.75 # 99
Group Score 4.50 # 91
Visual Reasoning Winoground OFA base (ITM) Text Score 26.75 # 81
Image Score 10.75 # 89
Group Score 6.50 # 87
Visual Reasoning Winoground OFA large (ITM) Text Score 30.75 # 62
Image Score 10.25 # 93
Group Score 7.25 # 84
Visual Reasoning Winoground OFA tiny (TLC-A) Text Score 16.50 # 108
Image Score 15.75 # 64
Group Score 6.75 # 86
Visual Reasoning Winoground OFA base (TLC-A) Text Score 24.50 # 87
Image Score 23.50 # 41
Group Score 13.75 # 50
Visual Reasoning Winoground OFA large (TLC-A) Text Score 29.25 # 72
Image Score 27.00 # 26
Group Score 17.50 # 37

Methods