TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Captioning	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	OFA Large	BLEU-4	0	# 6
Image Captioning	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	OFA Large	CIDEr	0	# 6
Image-to-Text Retrieval	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Text-only FT)	Specificity	94	# 1
Image-to-Text Retrieval	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Fine-tuned)	Specificity	84	# 2
Image-to-Text Retrieval	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XL (Fine-tuned)	Specificity	81	# 3
Image-to-Text Retrieval	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Zero-shot)	Specificity	71	# 6
Image-to-Text Retrieval	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP Large	Specificity	77	# 4
Image-to-Text Retrieval	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	CoCa ViT-L-14 MSCOCO	Specificity	72	# 5
Image-to-Text Retrieval	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	CLIP ViT-L/14	Specificity	70	# 7
Visual Question Answering (VQA)	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Text-only FT)	Exact Match	4	# 6
Visual Question Answering (VQA)	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Text-only FT)	BEM	24	# 6
Visual Question Answering (VQA)	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Fine-tuned)	Exact Match	21	# 1
Visual Question Answering (VQA)	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Fine-tuned)	BEM	57	# 1
Visual Question Answering (VQA)	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XL (Fine-tuned)	Exact Match	20	# 2
Visual Question Answering (VQA)	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XL (Fine-tuned)	BEM	55	# 2
Visual Question Answering (VQA)	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Zero-shot)	Exact Match	15	# 3
Visual Question Answering (VQA)	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Zero-shot)	BEM	55	# 2
Visual Question Answering (VQA)	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP Large	Exact Match	6	# 5
Visual Question Answering (VQA)	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP Large	BEM	39	# 4
Visual Question Answering (VQA)	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	OFA Large	Exact Match	8	# 4
Visual Question Answering (VQA)	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	OFA Large	BEM	38	# 5
Image Captioning	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Fine-tuned)	BLEU-4	42	# 1
Image Captioning	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Fine-tuned)	CIDEr	177	# 1
Image Captioning	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XL (Fine-tuned)	BLEU-4	41	# 2
Image Captioning	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XL (Fine-tuned)	CIDEr	174	# 2
Image Captioning	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Zero-Shot)	BLEU-4	31	# 3
Image Captioning	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Zero-Shot)	CIDEr	120	# 3
Image Captioning	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP Large	BLEU-4	13	# 5
Image Captioning	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP Large	CIDEr	65	# 5
Image Captioning	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	CoCa ViT-L-14 MSCOCO	BLEU-4	25	# 4
Image Captioning	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	CoCa ViT-L-14 MSCOCO	CIDEr	102	# 4
Explanation Generation	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	Ground-truth Caption -> GPT3 (Oracle)	Human (%)	68	# 1
Explanation Generation	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	Predicted Caption -> GPT3	Human (%)	33	# 2
Explanation Generation	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Fine-tuned)	Human (%)	27	# 3
Explanation Generation	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XL (Fine-tuned)	Human (%)	15	# 4
Explanation Generation	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Zero-shot)	Human (%)	0	# 5

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/breaking-common-sense-whoops-a-vision-and/image-to-text-retrieval-on-whoops)](https://paperswithcode.com/sota/image-to-text-retrieval-on-whoops?p=breaking-common-sense-whoops-a-vision-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/breaking-common-sense-whoops-a-vision-and/visual-question-answering-vqa-on-whoops)](https://paperswithcode.com/sota/visual-question-answering-vqa-on-whoops?p=breaking-common-sense-whoops-a-vision-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/breaking-common-sense-whoops-a-vision-and/image-captioning-on-whoops)](https://paperswithcode.com/sota/image-captioning-on-whoops?p=breaking-common-sense-whoops-a-vision-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/breaking-common-sense-whoops-a-vision-and/explanation-generation-on-whoops)](https://paperswithcode.com/sota/explanation-generation-on-whoops?p=breaking-common-sense-whoops-a-vision-and)`

Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

ICCV 2023 · Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, Roy Schwartz ·

Weird, unusual, and uncanny images pique the curiosity of observers because they challenge commonsense. For example, an image released during the 2022 world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo playing chess, which playfully violates our expectation that their competition should occur on the football field. Humans can easily recognize and interpret these unconventional images, but can AI models do the same? We introduce WHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. We consider several tasks posed over the dataset. In addition to image captioning, cross-modal matching, and visual question answering, we introduce a difficult explanation generation task, where models must identify and explain why a given image is unusual. Our results show that state-of-the-art models such as GPT3 and BLIP2 still lag behind human performance on WHOOPS!. We hope our dataset will inspire the development of AI models with stronger visual commonsense reasoning abilities. Data, models and code are available at the project website: whoops-benchmark.github.io

PDF Abstract ICCV 2023 PDF ICCV 2023 Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Common Sense Reasoning

Explanation Generation

Image Captioning

Image Generation

Image-to-Text Retrieval

Question Answering

Visual Commonsense Reasoning

Visual Question Answering

Visual Question Answering (VQA)

Datasets

Introduced in the Paper:

WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

Results from the Paper

Edit

Ranked #1 on Image-to-Text Retrieval on WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Captioning	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	OFA Large	BLEU-4	0	# 6	Compare
Image Captioning		OFA Large	CIDEr	0	# 6	Compare
Image-to-Text Retrieval	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Text-only FT)	Specificity	94	# 1	Compare
Image-to-Text Retrieval	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Fine-tuned)	Specificity	84	# 2	Compare
Image-to-Text Retrieval	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XL (Fine-tuned)	Specificity	81	# 3	Compare
Image-to-Text Retrieval	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Zero-shot)	Specificity	71	# 6	Compare
Image-to-Text Retrieval	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP Large	Specificity	77	# 4	Compare
Image-to-Text Retrieval	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	CoCa ViT-L-14 MSCOCO	Specificity	72	# 5	Compare
Image-to-Text Retrieval	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	CLIP ViT-L/14	Specificity	70	# 7	Compare
Visual Question Answering (VQA)	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Text-only FT)	Exact Match	4	# 6	Compare
Visual Question Answering (VQA)		BLIP2 FlanT5-XXL (Text-only FT)	BEM	24	# 6	Compare
Visual Question Answering (VQA)	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Fine-tuned)	Exact Match	21	# 1	Compare
Visual Question Answering (VQA)		BLIP2 FlanT5-XXL (Fine-tuned)	BEM	57	# 1	Compare
Visual Question Answering (VQA)	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XL (Fine-tuned)	Exact Match	20	# 2	Compare
Visual Question Answering (VQA)		BLIP2 FlanT5-XL (Fine-tuned)	BEM	55	# 2	Compare
Visual Question Answering (VQA)	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Zero-shot)	Exact Match	15	# 3	Compare
Visual Question Answering (VQA)		BLIP2 FlanT5-XXL (Zero-shot)	BEM	55	# 2	Compare
Visual Question Answering (VQA)	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP Large	Exact Match	6	# 5	Compare
Visual Question Answering (VQA)		BLIP Large	BEM	39	# 4	Compare
Visual Question Answering (VQA)	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	OFA Large	Exact Match	8	# 4	Compare
Visual Question Answering (VQA)		OFA Large	BEM	38	# 5	Compare
Image Captioning	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Fine-tuned)	BLEU-4	42	# 1	Compare
Image Captioning		BLIP2 FlanT5-XXL (Fine-tuned)	CIDEr	177	# 1	Compare
Image Captioning	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XL (Fine-tuned)	BLEU-4	41	# 2	Compare
Image Captioning		BLIP2 FlanT5-XL (Fine-tuned)	CIDEr	174	# 2	Compare
Image Captioning	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Zero-Shot)	BLEU-4	31	# 3	Compare
Image Captioning		BLIP2 FlanT5-XXL (Zero-Shot)	CIDEr	120	# 3	Compare
Image Captioning	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP Large	BLEU-4	13	# 5	Compare
Image Captioning		BLIP Large	CIDEr	65	# 5	Compare
Image Captioning	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	CoCa ViT-L-14 MSCOCO	BLEU-4	25	# 4	Compare
Image Captioning		CoCa ViT-L-14 MSCOCO	CIDEr	102	# 4	Compare
Explanation Generation	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	Ground-truth Caption -> GPT3 (Oracle)	Human (%)	68	# 1	Compare
Explanation Generation	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	Predicted Caption -> GPT3	Human (%)	33	# 2	Compare
Explanation Generation	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Fine-tuned)	Human (%)	27	# 3	Compare
Explanation Generation	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XL (Fine-tuned)	Human (%)	15	# 4	Compare
Explanation Generation	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Zero-shot)	Human (%)	0	# 5	Compare

Methods

Add Remove

1x1 Convolution • Adam • Batch Normalization • BigGAN • Conditional Batch Normalization • Convolution • Dense Connections • Dot-Product Attention • Early Stopping • Feedforward Network • GAN Hinge Loss • Linear Layer • Non-Local Block • Non-Local Operation • Off-Diagonal Orthogonal Regularization • Projection Discriminator • ReLU • Residual Block • Residual Connection • SAGAN • SAGAN Self-Attention Module • Softmax • Spectral Normalization • Truncation Trick • TTUR

Edit Social Preview

Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove