Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

Weird, unusual, and uncanny images pique the curiosity of observers because they challenge commonsense. For example, an image released during the 2022 world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo playing chess, which playfully violates our expectation that their competition should occur on the football field. Humans can easily recognize and interpret these unconventional images, but can AI models do the same? We introduce WHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. We consider several tasks posed over the dataset. In addition to image captioning, cross-modal matching, and visual question answering, we introduce a difficult explanation generation task, where models must identify and explain why a given image is unusual. Our results show that state-of-the-art models such as GPT3 and BLIP2 still lag behind human performance on WHOOPS!. We hope our dataset will inspire the development of AI models with stronger visual commonsense reasoning abilities. Data, models and code are available at the project website: whoops-benchmark.github.io

PDF Abstract ICCV 2023 PDF ICCV 2023 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Image Captioning WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images OFA Large BLEU-4 0 # 6
CIDEr 0 # 6
Image-to-Text Retrieval WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images BLIP2 FlanT5-XXL (Text-only FT) Specificity 94 # 1
Image-to-Text Retrieval WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images BLIP2 FlanT5-XXL (Fine-tuned) Specificity 84 # 2
Image-to-Text Retrieval WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images BLIP2 FlanT5-XL (Fine-tuned) Specificity 81 # 3
Image-to-Text Retrieval WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images BLIP2 FlanT5-XXL (Zero-shot) Specificity 71 # 6
Image-to-Text Retrieval WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images BLIP Large Specificity 77 # 4
Image-to-Text Retrieval WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images CoCa ViT-L-14 MSCOCO Specificity 72 # 5
Image-to-Text Retrieval WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images CLIP ViT-L/14 Specificity 70 # 7
Visual Question Answering (VQA) WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images BLIP2 FlanT5-XXL (Text-only FT) Exact Match 4 # 6
BEM 24 # 6
Visual Question Answering (VQA) WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images BLIP2 FlanT5-XXL (Fine-tuned) Exact Match 21 # 1
BEM 57 # 1
Visual Question Answering (VQA) WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images BLIP2 FlanT5-XL (Fine-tuned) Exact Match 20 # 2
BEM 55 # 2
Visual Question Answering (VQA) WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images BLIP2 FlanT5-XXL (Zero-shot) Exact Match 15 # 3
BEM 55 # 2
Visual Question Answering (VQA) WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images BLIP Large Exact Match 6 # 5
BEM 39 # 4
Visual Question Answering (VQA) WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images OFA Large Exact Match 8 # 4
BEM 38 # 5
Image Captioning WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images BLIP2 FlanT5-XXL (Fine-tuned) BLEU-4 42 # 1
CIDEr 177 # 1
Image Captioning WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images BLIP2 FlanT5-XL (Fine-tuned) BLEU-4 41 # 2
CIDEr 174 # 2
Image Captioning WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images BLIP2 FlanT5-XXL (Zero-Shot) BLEU-4 31 # 3
CIDEr 120 # 3
Image Captioning WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images BLIP Large BLEU-4 13 # 5
CIDEr 65 # 5
Image Captioning WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images CoCa ViT-L-14 MSCOCO BLEU-4 25 # 4
CIDEr 102 # 4
Explanation Generation WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images Ground-truth Caption -> GPT3 (Oracle) Human (%) 68 # 1
Explanation Generation WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images Predicted Caption -> GPT3 Human (%) 33 # 2
Explanation Generation WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images BLIP2 FlanT5-XXL (Fine-tuned) Human (%) 27 # 3
Explanation Generation WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images BLIP2 FlanT5-XL (Fine-tuned) Human (%) 15 # 4
Explanation Generation WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images BLIP2 FlanT5-XXL (Zero-shot) Human (%) 0 # 5

Methods