VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena

We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We build VALSE using methods that support the construction of valid foils, and report results from evaluating five widely-used V&L models. Our experiments suggest that current models have considerable difficulty addressing most phenomena. Hence, we expect VALSE to serve as an important benchmark to measure future progress of pretrained V&L models from a linguistic perspective, complementing the canonical task-centred V&L evaluations.

PDF Abstract ACL 2022 PDF ACL 2022 Abstract

Datasets


Introduced in the Paper:

VALSE

Used in the Paper:

COCO Visual Question Answering VisDial Visual7W

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
image-sentence alignment VALSE GPT1 average pairwise accuracy 60.7 # 4
image-sentence alignment VALSE LXMERT Average Accuracy 53.5 # 2
average pairwise accuracy 59.6 # 6
image-sentence alignment VALSE ViLBERT Average Accuracy 51.3 # 3
average pairwise accuracy 63.7 # 3
image-sentence alignment VALSE VisualBERT Average Accuracy 48.8 # 4
average pairwise accuracy 46.4 # 7
image-sentence alignment VALSE ViLBERT 12-in-1 Average Accuracy 63.2 # 1
average pairwise accuracy 75.1 # 1
image-sentence alignment VALSE CLIP average pairwise accuracy 64.0 # 2
image-sentence alignment VALSE GPT2 average pairwise accuracy 60.1 # 5
image-sentence alignment VALSE actant swap LXMERT Accuracy (%) 48.5 # 4
pairwise accuracy 45.8 # 6
image-sentence alignment VALSE actant swap GPT1 pairwise accuracy 72.2 # 2
image-sentence alignment VALSE actant swap GPT2 pairwise accuracy 76.9 # 1
image-sentence alignment VALSE actant swap VisualBERT Accuracy (%) 49.7 # 3
pairwise accuracy 44.4 # 7
image-sentence alignment VALSE actant swap ViLBERT 12-in-1 Accuracy (%) 52.2 # 1
pairwise accuracy 58.9 # 5
image-sentence alignment VALSE actant swap CLIP pairwise accuracy 68.6 # 3
image-sentence alignment VALSE actant swap ViLBERT Accuracy (%) 50.4 # 2
pairwise accuracy 68.3 # 4
image-sentence alignment VALSE action replacement CLIP pairwise accuracy 75.6 # 1
image-sentence alignment VALSE action replacement VisualBERT pairwise accuracy 49.2 # 7
Accuracy (%) 48.8 # 4
image-sentence alignment VALSE action replacement ViLBERT 12-in-1 pairwise accuracy 65.9 # 4
Accuracy (%) 57.3 # 1
image-sentence alignment VALSE action replacement ViLBERT pairwise accuracy 70.7 # 2
Accuracy (%) 52.6 # 2
image-sentence alignment VALSE action replacement GPT1 pairwise accuracy 65.4 # 5
image-sentence alignment VALSE action replacement LXMERT pairwise accuracy 54.8 # 6
Accuracy (%) 51.1 # 3
image-sentence alignment VALSE action replacement GPT2 pairwise accuracy 66.8 # 3
image-sentence alignment VALSE coreference clean VisualBERT Accuracy (%) 50.0 # 2
pairwise accuracy 47.6 # 5
image-sentence alignment VALSE coreference clean LXMERT Accuracy (%) 49.0 # 4
pairwise accuracy 44.2 # 7
image-sentence alignment VALSE coreference clean ViLBERT 12-in-1 Accuracy (%) 54.3 # 1
pairwise accuracy 69.2 # 1
image-sentence alignment VALSE coreference clean ViLBERT Accuracy (%) 50.0 # 2
pairwise accuracy 48.1 # 4
image-sentence alignment VALSE coreference clean CLIP pairwise accuracy 49.7 # 3
image-sentence alignment VALSE coreference clean GPT2 pairwise accuracy 50.0 # 2
image-sentence alignment VALSE coreference clean GPT1 pairwise accuracy 45.2 # 6
image-sentence alignment VALSE coreference standard GPT2 pairwise accuracy 54.5 # 2
image-sentence alignment VALSE coreference standard GPT1 pairwise accuracy 45.6 # 7
image-sentence alignment VALSE coreference standard LXMERT pairwise accuracy 46.8 # 6
Accuracy (%) 49.8 # 4
image-sentence alignment VALSE coreference standard ViLBERT pairwise accuracy 47.2 # 5
Accuracy (%) 50.0 # 2
image-sentence alignment VALSE coreference standard ViLBERT 12-in-1 pairwise accuracy 75.7 # 1
Accuracy (%) 54.4 # 1
image-sentence alignment VALSE coreference standard VisualBERT pairwise accuracy 49.5 # 4
Accuracy (%) 50.0 # 2
image-sentence alignment VALSE coreference standard CLIP pairwise accuracy 52.1 # 3
image-sentence alignment VALSE counting adversarial LXMERT pairwise accuracy 42.6 # 7
Accuracy (%) 49.9 # 4
image-sentence alignment VALSE counting adversarial ViLBERT pairwise accuracy 73.7 # 2
Accuracy (%) 51.8 # 2
image-sentence alignment VALSE counting adversarial ViLBERT 12-in-1 pairwise accuracy 77.3 # 1
Accuracy (%) 66.7 # 1
image-sentence alignment VALSE counting adversarial CLIP pairwise accuracy 57.5 # 4
image-sentence alignment VALSE counting adversarial VisualBERT pairwise accuracy 50.0 # 5
Accuracy (%) 50.0 # 3
image-sentence alignment VALSE counting adversarial GPT2 pairwise accuracy 45.3 # 6
image-sentence alignment VALSE counting adversarial GPT1 pairwise accuracy 69.5 # 3
image-sentence alignment VALSE counting balanced GPT2 pairwise accuracy 51.6 # 5
image-sentence alignment VALSE counting balanced CLIP pairwise accuracy 62.1 # 3
image-sentence alignment VALSE counting balanced ViLBERT 12-in-1 pairwise accuracy 76.7 # 1
Accuracy (%) 64.9 # 1
image-sentence alignment VALSE counting balanced LXMERT pairwise accuracy 62.2 # 2
Accuracy (%) 52.0 # 2
image-sentence alignment VALSE counting balanced ViLBERT pairwise accuracy 58.6 # 4
Accuracy (%) 50.7 # 3
image-sentence alignment VALSE counting balanced VisualBERT pairwise accuracy 48.2 # 7
Accuracy (%) 48.3 # 4
image-sentence alignment VALSE counting balanced GPT1 pairwise accuracy 51.2 # 6
image-sentence alignment VALSE counting small numbers GPT1 pairwise accuracy 48.7 # 6
image-sentence alignment VALSE counting small numbers ViLBERT Accuracy (%) 50.6 # 3
pairwise accuracy 62.9 # 3
image-sentence alignment VALSE counting small numbers ViLBERT 12-in-1 Accuracy (%) 69.2 # 1
pairwise accuracy 80.2 # 1
image-sentence alignment VALSE counting small numbers VisualBERT Accuracy (%) 47.8 # 4
pairwise accuracy 48.2 # 7
image-sentence alignment VALSE counting small numbers CLIP pairwise accuracy 62.5 # 4
image-sentence alignment VALSE counting small numbers LXMERT Accuracy (%) 55.4 # 2
pairwise accuracy 69.2 # 2
image-sentence alignment VALSE counting small numbers GPT2 pairwise accuracy 49.8 # 5
image-sentence alignment VALSE existence VisualBERT pairwise accuracy 39.7 # 7
Accuracy (%) 49.3 # 3
image-sentence alignment VALSE existence GPT1 pairwise accuracy 61.8 # 5
image-sentence alignment VALSE existence GPT2 pairwise accuracy 58.0 # 6
image-sentence alignment VALSE existence CLIP pairwise accuracy 66.9 # 3
image-sentence alignment VALSE existence ViLBERT 12-in-1 pairwise accuracy 95.6 # 1
Accuracy (%) 89.0 # 1
image-sentence alignment VALSE existence ViLBERT pairwise accuracy 66.5 # 4
Accuracy (%) 2.4 # 4
image-sentence alignment VALSE existence LXMERT pairwise accuracy 78.6 # 2
Accuracy (%) 55.8 # 2
image-sentence alignment VALSE foil-it (noun phrases) CLIP pairwise accuracy 88.8 # 1
image-sentence alignment VALSE foil-it (noun phrases) LXMERT pairwise accuracy 87.1 # 2
Accuracy (%) 70.8 # 2
image-sentence alignment VALSE foil-it (noun phrases) VisualBERT pairwise accuracy 48.5 # 7
Accuracy (%) 46.6 # 4
image-sentence alignment VALSE foil-it (noun phrases) ViLBERT 12-in-1 pairwise accuracy 86.9 # 3
Accuracy (%) 71.5 # 1
image-sentence alignment VALSE foil-it (noun phrases) ViLBERT pairwise accuracy 86.9 # 3
Accuracy (%) 55.9 # 3
image-sentence alignment VALSE foil-it (noun phrases) GPT1 pairwise accuracy 77.5 # 6
image-sentence alignment VALSE foil-it (noun phrases) GPT2 pairwise accuracy 80.7 # 5
image-sentence alignment VALSE plurality ViLBERT Accuracy (%) 50.3 # 3
pairwise accuracy 61.2 # 3
image-sentence alignment VALSE plurality ViLBERT 12-in-1 Accuracy (%) 62.0 # 1
pairwise accuracy 72.4 # 1
image-sentence alignment VALSE plurality GPT1 pairwise accuracy 53.1 # 5
image-sentence alignment VALSE plurality GPT2 pairwise accuracy 51.9 # 6
image-sentence alignment VALSE plurality VisualBERT Accuracy (%) 46.5 # 4
pairwise accuracy 45.7 # 7
image-sentence alignment VALSE plurality LXMERT Accuracy (%) 55.1 # 2
pairwise accuracy 64.4 # 2
image-sentence alignment VALSE plurality CLIP pairwise accuracy 56.2 # 4
image-sentence alignment VALSE spatial relations ViLBERT 12-in-1 Accuracy (%) 53.4 # 1
pairwise accuracy 67.7 # 3
image-sentence alignment VALSE spatial relations LXMERT Accuracy (%) 50.8 # 2
pairwise accuracy 60.2 # 5
image-sentence alignment VALSE spatial relations GPT1 pairwise accuracy 77.2 # 1
image-sentence alignment VALSE spatial relations GPT2 pairwise accuracy 75.0 # 2
image-sentence alignment VALSE spatial relations CLIP pairwise accuracy 64.3 # 4
image-sentence alignment VALSE spatial relations VisualBERT Accuracy (%) 49.3 # 4
pairwise accuracy 39.7 # 7
image-sentence alignment VALSE spatial relations ViLBERT Accuracy (%) 49.9 # 3
pairwise accuracy 57.2 # 6

Methods


No methods listed for this paper. Add relevant methods here