What You See is What You Read? Improving Text-Image Alignment Evaluation
Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models. Both methods surpass prior approaches in various text-image alignment tasks, with significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation.
PDF Abstract NeurIPS 2023 PDF NeurIPS 2023 AbstractCode
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Visual Reasoning | Winoground | COCA ViT-L14 (f.t on COCO) | Text Score | 28.25 | # 76 | |
Image Score | 11.50 | # 86 | ||||
Group Score | 8.25 | # 78 | ||||
Visual Reasoning | Winoground | OFA large (ft SNLI-VE) | Text Score | 27.70 | # 80 | |
Image Score | 14.30 | # 72 | ||||
Group Score | 9.00 | # 75 | ||||
Visual Reasoning | Winoground | CLIP RN50x64 | Text Score | 26.50 | # 82 | |
Image Score | 13.75 | # 77 | ||||
Group Score | 10.25 | # 68 | ||||
Visual Reasoning | Winoground | TIFA | Text Score | 19.00 | # 102 | |
Image Score | 12.50 | # 83 | ||||
Group Score | 11.30 | # 63 | ||||
Visual Reasoning | Winoground | BLIP2 (ft COCO) | Text Score | 44.00 | # 20 | |
Image Score | 26.00 | # 29 | ||||
Group Score | 23.50 | # 20 | ||||
Visual Reasoning | Winoground | PaLI (ft SNLI-VE) | Text Score | 45.00 | # 17 | |
Image Score | 41.50 | # 12 | ||||
Group Score | 28.70 | # 15 | ||||
Visual Reasoning | Winoground | PaLI (ft SNLI-VE + Synthetic Data) | Text Score | 46.5 | # 14 | |
Image Score | 38 | # 14 | ||||
Group Score | 28.75 | # 14 | ||||
Visual Reasoning | Winoground | VQ2 | Text Score | 47 | # 11 | |
Image Score | 42.2 | # 11 | ||||
Group Score | 30.5 | # 13 |