TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Reasoning	Winoground	NegCLIP	Text Score	29.5	# 71
Visual Reasoning	Winoground	NegCLIP	Image Score	10.5	# 91
Visual Reasoning	Winoground	NegCLIP	Group Score	8.0	# 80
Visual Reasoning	Winoground	MiniGPT-4	Text Score	23.3	# 90
Visual Reasoning	Winoground	MiniGPT-4	Image Score	18.0	# 56
Visual Reasoning	Winoground	MiniGPT-4	Group Score	9.5	# 71
Visual Reasoning	Winoground	LLaVA	Text Score	24.8	# 86
Visual Reasoning	Winoground	LLaVA	Image Score	25.0	# 34
Visual Reasoning	Winoground	LLaVA	Group Score	13.0	# 52
Visual Reasoning	Winoground	BLIP	Text Score	39.0	# 39
Visual Reasoning	Winoground	BLIP	Image Score	19.2	# 54
Visual Reasoning	Winoground	BLIP	Group Score	15.0	# 44
Visual Reasoning	Winoground	BLIP2	Text Score	42.0	# 30
Visual Reasoning	Winoground	BLIP2	Image Score	23.8	# 40
Visual Reasoning	Winoground	BLIP2	Group Score	19.0	# 32
Visual Reasoning	Winoground	NegBLIP2	Text Score	41.5	# 33
Visual Reasoning	Winoground	NegBLIP2	Image Score	26.0	# 29
Visual Reasoning	Winoground	NegBLIP2	Group Score	20.5	# 29
Visual Reasoning	Winoground	NegBLIP	Text Score	42.5	# 27
Visual Reasoning	Winoground	NegBLIP	Image Score	24.0	# 39
Visual Reasoning	Winoground	NegBLIP	Group Score	18.5	# 35
Visual Reasoning	Winoground	BLIP2 (SGVL)	Text Score	42.8	# 24
Visual Reasoning	Winoground	BLIP2 (SGVL)	Image Score	28.5	# 22
Visual Reasoning	Winoground	BLIP2 (SGVL)	Group Score	23.3	# 21
Visual Reasoning	Winoground	CLIP (SGVL)	Text Score	32.0	# 60
Visual Reasoning	Winoground	CLIP (SGVL)	Image Score	14.0	# 73
Visual Reasoning	Winoground	CLIP (SGVL)	Group Score	9.8	# 70
Visual Reasoning	Winoground	BLIP (SGVL)	Text Score	42.8	# 24
Visual Reasoning	Winoground	BLIP (SGVL)	Image Score	27.3	# 25
Visual Reasoning	Winoground	BLIP (SGVL)	Group Score	21.5	# 25
Visual Reasoning	Winoground	BLIP (+Graph Text, +Graph Neg)	Text Score	40.5	# 34
Visual Reasoning	Winoground	BLIP (+Graph Text, +Graph Neg)	Image Score	25.5	# 33
Visual Reasoning	Winoground	BLIP (+Graph Text, +Graph Neg)	Group Score	19.0	# 32
Visual Reasoning	Winoground	BLIP (+Graph Text)	Text Score	40.3	# 35
Visual Reasoning	Winoground	BLIP (+Graph Text)	Image Score	20.5	# 50
Visual Reasoning	Winoground	BLIP (+Graph Text)	Group Score	16.5	# 41

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/incorporating-structured-representations-into/visual-reasoning-on-winoground)](https://paperswithcode.com/sota/visual-reasoning-on-winoground?p=incorporating-structured-representations-into)`

Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs

10 May 2023 · Roei Herzig, Alon Mendelson, Leonid Karlinsky, Assaf Arbelle, Rogerio Feris, Trevor Darrell, Amir Globerson ·

Vision and language models (VLMs) have demonstrated remarkable zero-shot (ZS) performance in a variety of tasks. However, recent works have shown that even the best VLMs struggle to capture aspects of compositional scene understanding, such as object attributes, relations, and action states. In contrast, obtaining structured annotations, such as scene graphs (SGs), that could improve these models is time-consuming and costly, and thus cannot be used on a large scale. Here we ask whether small SG datasets can provide sufficient information for enhancing structured understanding of pretrained VLMs. We show that it is indeed possible to improve VLMs when learning from SGs by integrating components that incorporate structured information into both visual and textual representations. For the visual side, we incorporate a special "SG Component" in the image transformer trained to predict SG information, while for the textual side, we utilize SGs to generate fine-grained captions that highlight different compositional aspects of the scene. Our method improves the performance of several popular VLMs on multiple VL datasets with only a mild degradation in ZS capabilities.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Scene Understanding

Visual Reasoning

Datasets

MS COCO

Visual Genome Winoground

VSR ELEVATER

ARO

Results from the Paper

Edit

Ranked #24 on Visual Reasoning on Winoground

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Reasoning	Winoground	NegCLIP	Text Score	29.5	# 71	Compare
			Image Score	10.5	# 91	Compare
			Group Score	8.0	# 80	Compare
Visual Reasoning	Winoground	MiniGPT-4	Text Score	23.3	# 90	Compare
			Image Score	18.0	# 56	Compare
			Group Score	9.5	# 71	Compare
Visual Reasoning	Winoground	LLaVA	Text Score	24.8	# 86	Compare
			Image Score	25.0	# 34	Compare
			Group Score	13.0	# 52	Compare
Visual Reasoning	Winoground	BLIP	Text Score	39.0	# 39	Compare
			Image Score	19.2	# 54	Compare
			Group Score	15.0	# 44	Compare
Visual Reasoning	Winoground	BLIP2	Text Score	42.0	# 30	Compare
			Image Score	23.8	# 40	Compare
			Group Score	19.0	# 32	Compare
Visual Reasoning	Winoground	NegBLIP2	Text Score	41.5	# 33	Compare
			Image Score	26.0	# 29	Compare
			Group Score	20.5	# 29	Compare
Visual Reasoning	Winoground	NegBLIP	Text Score	42.5	# 27	Compare
			Image Score	24.0	# 39	Compare
			Group Score	18.5	# 35	Compare
Visual Reasoning	Winoground	BLIP2 (SGVL)	Text Score	42.8	# 24	Compare
			Image Score	28.5	# 22	Compare
			Group Score	23.3	# 21	Compare
Visual Reasoning	Winoground	CLIP (SGVL)	Text Score	32.0	# 60	Compare
			Image Score	14.0	# 73	Compare
			Group Score	9.8	# 70	Compare
Visual Reasoning	Winoground	BLIP (SGVL)	Text Score	42.8	# 24	Compare
			Image Score	27.3	# 25	Compare
			Group Score	21.5	# 25	Compare
Visual Reasoning	Winoground	BLIP (+Graph Text, +Graph Neg)	Text Score	40.5	# 34	Compare
			Image Score	25.5	# 33	Compare
			Group Score	19.0	# 32	Compare
Visual Reasoning	Winoground	BLIP (+Graph Text)	Text Score	40.3	# 35	Compare
			Image Score	20.5	# 50	Compare
			Group Score	16.5	# 41	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove