What You See is What You Read? Improving Text-Image Alignment Evaluation

Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models. Both methods surpass prior approaches in various text-image alignment tasks, with significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation.

PDF Abstract NeurIPS 2023 PDF NeurIPS 2023 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Visual Reasoning Winoground COCA ViT-L14 (f.t on COCO) Text Score 28.25 # 76
Image Score 11.50 # 86
Group Score 8.25 # 78
Visual Reasoning Winoground OFA large (ft SNLI-VE) Text Score 27.70 # 80
Image Score 14.30 # 72
Group Score 9.00 # 75
Visual Reasoning Winoground CLIP RN50x64 Text Score 26.50 # 82
Image Score 13.75 # 77
Group Score 10.25 # 68
Visual Reasoning Winoground TIFA Text Score 19.00 # 102
Image Score 12.50 # 83
Group Score 11.30 # 63
Visual Reasoning Winoground BLIP2 (ft COCO) Text Score 44.00 # 20
Image Score 26.00 # 29
Group Score 23.50 # 20
Visual Reasoning Winoground PaLI (ft SNLI-VE) Text Score 45.00 # 17
Image Score 41.50 # 12
Group Score 28.70 # 15
Visual Reasoning Winoground PaLI (ft SNLI-VE + Synthetic Data) Text Score 46.5 # 14
Image Score 38 # 14
Group Score 28.75 # 14
Visual Reasoning Winoground VQ2 Text Score 47 # 11
Image Score 42.2 # 11
Group Score 30.5 # 13

Methods


No methods listed for this paper. Add relevant methods here