TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	ViT-L-14 (LAION400M)	Recall@1 (HN-Atom + HN-Comp, SC)	39.44	# 1
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	ViT-L-14 (LAION400M)	Recall@1 (HN-Atom + HN-Comp, UC)	33.81	# 1
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	ViT-L-14 (LAION400M)	Recall@1 (HN-Atom, UC)	47.86	# 1
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	ViT-L-14 (LAION400M)	Recall@1 (HN-Comp, UC)	60.78	# 6
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	ViT-B-16 (LAION400M)	Recall@1 (HN-Atom + HN-Comp, SC)	37.01	# 3
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	ViT-B-16 (LAION400M)	Recall@1 (HN-Atom + HN-Comp, UC)	30.81	# 3
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	ViT-B-16 (LAION400M)	Recall@1 (HN-Atom, UC)	44.93	# 3
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	ViT-B-16 (LAION400M)	Recall@1 (HN-Comp, UC)	59.00	# 8
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	ViT-B-16+240 (LAION400M)	Recall@1 (HN-Atom + HN-Comp, SC)	37.32	# 2
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	ViT-B-16+240 (LAION400M)	Recall@1 (HN-Atom + HN-Comp, UC)	32.26	# 2
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	ViT-B-16+240 (LAION400M)	Recall@1 (HN-Atom, UC)	46.53	# 2
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	ViT-B-16+240 (LAION400M)	Recall@1 (HN-Comp, UC)	60.19	# 7
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	Random	Recall@1 (HN-Atom + HN-Comp, SC)	9.09	# 8
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	Random	Recall@1 (HN-Atom + HN-Comp, UC)	9.09	# 8
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	Random	Recall@1 (HN-Atom, UC)	20.00	# 22
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	Random	Recall@1 (HN-Comp, UC)	14.29	# 22
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	ViT-B-32 (LAION400M)	Recall@1 (HN-Atom + HN-Comp, SC)	34.28	# 4
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	ViT-B-32 (LAION400M)	Recall@1 (HN-Atom + HN-Comp, UC)	28.00	# 4
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	ViT-B-32 (LAION400M)	Recall@1 (HN-Atom, UC)	42.75	# 6
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	ViT-B-32 (LAION400M)	Recall@1 (HN-Comp, UC)	54.80	# 9
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	RN101 (YFCC15M)	Recall@1 (HN-Atom + HN-Comp, SC)	22.74	# 7
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	RN101 (YFCC15M)	Recall@1 (HN-Atom + HN-Comp, UC)	20.50	# 5
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	RN101 (YFCC15M)	Recall@1 (HN-Atom, UC)	39.50	# 12
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	RN101 (YFCC15M)	Recall@1 (HN-Comp, UC)	39.56	# 19
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	RN50 (YFCC15M)	Recall@1 (HN-Atom + HN-Comp, SC)	23.38	# 5
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	RN50 (YFCC15M)	Recall@1 (HN-Atom + HN-Comp, UC)	20.08	# 6
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	RN50 (YFCC15M)	Recall@1 (HN-Atom, UC)	39.85	# 10
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	RN50 (YFCC15M)	Recall@1 (HN-Comp, UC)	39.83	# 17
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	RN50 (CC12M)	Recall@1 (HN-Atom + HN-Comp, SC)	23.26	# 6
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	RN50 (CC12M)	Recall@1 (HN-Atom + HN-Comp, UC)	19.96	# 7
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	RN50 (CC12M)	Recall@1 (HN-Atom, UC)	34.88	# 21
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	RN50 (CC12M)	Recall@1 (HN-Comp, UC)	45.27	# 13

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/crepe-can-vision-language-foundation-models/image-retrieval-on-crepe-vision-language)](https://paperswithcode.com/sota/image-retrieval-on-crepe-vision-language?p=crepe-can-vision-language-foundation-models)`

CREPE: Can Vision-Language Foundation Models Reason Compositionally?

CVPR 2023 · Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao, Ranjay Krishna ·

A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that: across 7 architectures trained with 4 algorithms on massive datasets, they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over $370K$ image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate $325K$, $316K$, and $309K$ hard negative captions for a subset of the pairs. To test productivity, CREPE contains $17K$ image-text pairs with nine different complexities plus $183K$ hard negative captions with atomic, swapping and negation foils. The datasets are generated by repurposing the Visual Genome scene graphs and region descriptions and applying handcrafted templates and GPT-3. For systematicity, we find that model performance decreases consistently when novel compositions dominate the retrieval set, with Recall@1 dropping by up to $12\%$. For productivity, models' retrieval success decays as complexity increases, frequently nearing random chance at high complexity. These results hold regardless of model and training dataset size.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Code

Add Remove Mark official

raivnlab/crepe official

Tasks

Add Remove

Image Retrieval

Negation

Retrieval

Datasets

Introduced in the Paper:

CREPE (Compositional REPresentation Evaluation)

Used in the Paper:

Visual Genome

Conceptual Captions

ATOMIC

LAION-400M

CC12M

Results from the Paper

Edit

Ranked #1 on Image Retrieval on CREPE (Compositional REPresentation Evaluation)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	ViT-L-14 (LAION400M)	Recall@1 (HN-Atom + HN-Comp, SC)	39.44	# 1	Compare
			Recall@1 (HN-Atom + HN-Comp, UC)	33.81	# 1	Compare
			Recall@1 (HN-Atom, UC)	47.86	# 1	Compare
			Recall@1 (HN-Comp, UC)	60.78	# 6	Compare
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	ViT-B-16 (LAION400M)	Recall@1 (HN-Atom + HN-Comp, SC)	37.01	# 3	Compare
			Recall@1 (HN-Atom + HN-Comp, UC)	30.81	# 3	Compare
			Recall@1 (HN-Atom, UC)	44.93	# 3	Compare
			Recall@1 (HN-Comp, UC)	59.00	# 8	Compare
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	ViT-B-16+240 (LAION400M)	Recall@1 (HN-Atom + HN-Comp, SC)	37.32	# 2	Compare
			Recall@1 (HN-Atom + HN-Comp, UC)	32.26	# 2	Compare
			Recall@1 (HN-Atom, UC)	46.53	# 2	Compare
			Recall@1 (HN-Comp, UC)	60.19	# 7	Compare
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	Random	Recall@1 (HN-Atom + HN-Comp, SC)	9.09	# 8	Compare
			Recall@1 (HN-Atom + HN-Comp, UC)	9.09	# 8	Compare
			Recall@1 (HN-Atom, UC)	20.00	# 22	Compare
			Recall@1 (HN-Comp, UC)	14.29	# 22	Compare
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	ViT-B-32 (LAION400M)	Recall@1 (HN-Atom + HN-Comp, SC)	34.28	# 4	Compare
			Recall@1 (HN-Atom + HN-Comp, UC)	28.00	# 4	Compare
			Recall@1 (HN-Atom, UC)	42.75	# 6	Compare
			Recall@1 (HN-Comp, UC)	54.80	# 9	Compare
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	RN101 (YFCC15M)	Recall@1 (HN-Atom + HN-Comp, SC)	22.74	# 7	Compare
			Recall@1 (HN-Atom + HN-Comp, UC)	20.50	# 5	Compare
			Recall@1 (HN-Atom, UC)	39.50	# 12	Compare
			Recall@1 (HN-Comp, UC)	39.56	# 19	Compare
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	RN50 (YFCC15M)	Recall@1 (HN-Atom + HN-Comp, SC)	23.38	# 5	Compare
			Recall@1 (HN-Atom + HN-Comp, UC)	20.08	# 6	Compare
			Recall@1 (HN-Atom, UC)	39.85	# 10	Compare
			Recall@1 (HN-Comp, UC)	39.83	# 17	Compare
Image Retrieval	CREPE (Compositional REPresentation Evaluation)	RN50 (CC12M)	Recall@1 (HN-Atom + HN-Comp, SC)	23.26	# 6	Compare
			Recall@1 (HN-Atom + HN-Comp, UC)	19.96	# 7	Compare
			Recall@1 (HN-Atom, UC)	34.88	# 21	Compare
			Recall@1 (HN-Comp, UC)	45.27	# 13	Compare

Methods

Add Remove

Adam • Attention Dropout • BPE • Cosine Annealing • Dense Connections • Dropout • Fixed Factorized Attention • GELU • GPT-3 • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Strided Attention • Test • Weight Decay

Edit Social Preview

CREPE: Can Vision-Language Foundation Models Reason Compositionally?

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove