VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena
We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We build VALSE using methods that support the construction of valid foils, and report results from evaluating five widely-used V&L models. Our experiments suggest that current models have considerable difficulty addressing most phenomena. Hence, we expect VALSE to serve as an important benchmark to measure future progress of pretrained V&L models from a linguistic perspective, complementing the canonical task-centred V&L evaluations.
PDF Abstract ACL 2022 PDF ACL 2022 AbstractCode
Datasets
Introduced in the Paper:

Used in the Paper:




Results from the Paper
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
image-sentence alignment | VALSE | ViLBERT 12-in-1 | Average Accuracy | 63.2 | # 1 | |
average pairwise accuracy | 75.1 | # 1 | ||||
image-sentence alignment | VALSE | VisualBERT | Average Accuracy | 48.8 | # 4 | |
average pairwise accuracy | 46.4 | # 7 | ||||
image-sentence alignment | VALSE | ViLBERT | Average Accuracy | 51.3 | # 3 | |
average pairwise accuracy | 63.7 | # 3 | ||||
image-sentence alignment | VALSE | LXMERT | Average Accuracy | 53.5 | # 2 | |
average pairwise accuracy | 59.6 | # 6 | ||||
image-sentence alignment | VALSE | CLIP | average pairwise accuracy | 64.0 | # 2 | |
image-sentence alignment | VALSE | GPT2 | average pairwise accuracy | 60.1 | # 5 | |
image-sentence alignment | VALSE | GPT1 | average pairwise accuracy | 60.7 | # 4 | |
image-sentence alignment | VALSE actant swap | CLIP | pairwise accuracy | 68.6 | # 3 | |
image-sentence alignment | VALSE actant swap | GPT1 | pairwise accuracy | 72.2 | # 2 | |
image-sentence alignment | VALSE actant swap | ViLBERT | Accuracy (%) | 50.4 | # 2 | |
pairwise accuracy | 68.3 | # 4 | ||||
image-sentence alignment | VALSE actant swap | LXMERT | Accuracy (%) | 48.5 | # 4 | |
pairwise accuracy | 45.8 | # 6 | ||||
image-sentence alignment | VALSE actant swap | ViLBERT 12-in-1 | Accuracy (%) | 52.2 | # 1 | |
pairwise accuracy | 58.9 | # 5 | ||||
image-sentence alignment | VALSE actant swap | VisualBERT | Accuracy (%) | 49.7 | # 3 | |
pairwise accuracy | 44.4 | # 7 | ||||
image-sentence alignment | VALSE actant swap | GPT2 | pairwise accuracy | 76.9 | # 1 | |
image-sentence alignment | VALSE action replacement | ViLBERT 12-in-1 | pairwise accuracy | 65.9 | # 4 | |
Accuracy (%) | 57.3 | # 1 | ||||
image-sentence alignment | VALSE action replacement | LXMERT | pairwise accuracy | 54.8 | # 6 | |
Accuracy (%) | 51.1 | # 3 | ||||
image-sentence alignment | VALSE action replacement | GPT1 | pairwise accuracy | 65.4 | # 5 | |
image-sentence alignment | VALSE action replacement | GPT2 | pairwise accuracy | 66.8 | # 3 | |
image-sentence alignment | VALSE action replacement | CLIP | pairwise accuracy | 75.6 | # 1 | |
image-sentence alignment | VALSE action replacement | VisualBERT | pairwise accuracy | 49.2 | # 7 | |
Accuracy (%) | 48.8 | # 4 | ||||
image-sentence alignment | VALSE action replacement | ViLBERT | pairwise accuracy | 70.7 | # 2 | |
Accuracy (%) | 52.6 | # 2 | ||||
image-sentence alignment | VALSE coreference clean | ViLBERT | Accuracy (%) | 50.0 | # 2 | |
pairwise accuracy | 48.1 | # 4 | ||||
image-sentence alignment | VALSE coreference clean | LXMERT | Accuracy (%) | 49.0 | # 4 | |
pairwise accuracy | 44.2 | # 7 | ||||
image-sentence alignment | VALSE coreference clean | GPT2 | pairwise accuracy | 50.0 | # 2 | |
image-sentence alignment | VALSE coreference clean | ViLBERT 12-in-1 | Accuracy (%) | 54.3 | # 1 | |
pairwise accuracy | 69.2 | # 1 | ||||
image-sentence alignment | VALSE coreference clean | VisualBERT | Accuracy (%) | 50.0 | # 2 | |
pairwise accuracy | 47.6 | # 5 | ||||
image-sentence alignment | VALSE coreference clean | CLIP | pairwise accuracy | 49.7 | # 3 | |
image-sentence alignment | VALSE coreference clean | GPT1 | pairwise accuracy | 45.2 | # 6 | |
image-sentence alignment | VALSE coreference standard | GPT2 | pairwise accuracy | 54.5 | # 2 | |
image-sentence alignment | VALSE coreference standard | CLIP | pairwise accuracy | 52.1 | # 3 | |
image-sentence alignment | VALSE coreference standard | VisualBERT | pairwise accuracy | 49.5 | # 4 | |
Accuracy (%) | 50.0 | # 2 | ||||
image-sentence alignment | VALSE coreference standard | ViLBERT 12-in-1 | pairwise accuracy | 75.7 | # 1 | |
Accuracy (%) | 54.4 | # 1 | ||||
image-sentence alignment | VALSE coreference standard | ViLBERT | pairwise accuracy | 47.2 | # 5 | |
Accuracy (%) | 50.0 | # 2 | ||||
image-sentence alignment | VALSE coreference standard | LXMERT | pairwise accuracy | 46.8 | # 6 | |
Accuracy (%) | 49.8 | # 4 | ||||
image-sentence alignment | VALSE coreference standard | GPT1 | pairwise accuracy | 45.6 | # 7 | |
image-sentence alignment | VALSE counting adversarial | CLIP | pairwise accuracy | 57.5 | # 4 | |
image-sentence alignment | VALSE counting adversarial | GPT1 | pairwise accuracy | 69.5 | # 3 | |
image-sentence alignment | VALSE counting adversarial | ViLBERT 12-in-1 | pairwise accuracy | 77.3 | # 1 | |
Accuracy (%) | 66.7 | # 1 | ||||
image-sentence alignment | VALSE counting adversarial | GPT2 | pairwise accuracy | 45.3 | # 6 | |
image-sentence alignment | VALSE counting adversarial | LXMERT | pairwise accuracy | 42.6 | # 7 | |
Accuracy (%) | 49.9 | # 4 | ||||
image-sentence alignment | VALSE counting adversarial | ViLBERT | pairwise accuracy | 73.7 | # 2 | |
Accuracy (%) | 51.8 | # 2 | ||||
image-sentence alignment | VALSE counting adversarial | VisualBERT | pairwise accuracy | 50.0 | # 5 | |
Accuracy (%) | 50.0 | # 3 | ||||
image-sentence alignment | VALSE counting balanced | GPT2 | pairwise accuracy | 51.6 | # 5 | |
image-sentence alignment | VALSE counting balanced | VisualBERT | pairwise accuracy | 48.2 | # 7 | |
Accuracy (%) | 48.3 | # 4 | ||||
image-sentence alignment | VALSE counting balanced | GPT1 | pairwise accuracy | 51.2 | # 6 | |
image-sentence alignment | VALSE counting balanced | CLIP | pairwise accuracy | 62.1 | # 3 | |
image-sentence alignment | VALSE counting balanced | ViLBERT 12-in-1 | pairwise accuracy | 76.7 | # 1 | |
Accuracy (%) | 64.9 | # 1 | ||||
image-sentence alignment | VALSE counting balanced | ViLBERT | pairwise accuracy | 58.6 | # 4 | |
Accuracy (%) | 50.7 | # 3 | ||||
image-sentence alignment | VALSE counting balanced | LXMERT | pairwise accuracy | 62.2 | # 2 | |
Accuracy (%) | 52.0 | # 2 | ||||
image-sentence alignment | VALSE counting small numbers | ViLBERT | Accuracy (%) | 50.6 | # 3 | |
pairwise accuracy | 62.9 | # 3 | ||||
image-sentence alignment | VALSE counting small numbers | ViLBERT 12-in-1 | Accuracy (%) | 69.2 | # 1 | |
pairwise accuracy | 80.2 | # 1 | ||||
image-sentence alignment | VALSE counting small numbers | CLIP | pairwise accuracy | 62.5 | # 4 | |
image-sentence alignment | VALSE counting small numbers | GPT2 | pairwise accuracy | 49.8 | # 5 | |
image-sentence alignment | VALSE counting small numbers | GPT1 | pairwise accuracy | 48.7 | # 6 | |
image-sentence alignment | VALSE counting small numbers | LXMERT | Accuracy (%) | 55.4 | # 2 | |
pairwise accuracy | 69.2 | # 2 | ||||
image-sentence alignment | VALSE counting small numbers | VisualBERT | Accuracy (%) | 47.8 | # 4 | |
pairwise accuracy | 48.2 | # 7 | ||||
image-sentence alignment | VALSE existence | GPT2 | pairwise accuracy | 58.0 | # 6 | |
image-sentence alignment | VALSE existence | ViLBERT | pairwise accuracy | 66.5 | # 4 | |
Accuracy (%) | 2.4 | # 4 | ||||
image-sentence alignment | VALSE existence | LXMERT | pairwise accuracy | 78.6 | # 2 | |
Accuracy (%) | 55.8 | # 2 | ||||
image-sentence alignment | VALSE existence | ViLBERT 12-in-1 | pairwise accuracy | 95.6 | # 1 | |
Accuracy (%) | 89.0 | # 1 | ||||
image-sentence alignment | VALSE existence | GPT1 | pairwise accuracy | 61.8 | # 5 | |
image-sentence alignment | VALSE existence | VisualBERT | pairwise accuracy | 39.7 | # 7 | |
Accuracy (%) | 49.3 | # 3 | ||||
image-sentence alignment | VALSE existence | CLIP | pairwise accuracy | 66.9 | # 3 | |
image-sentence alignment | VALSE foil-it (noun phrases) | ViLBERT | pairwise accuracy | 86.9 | # 3 | |
Accuracy (%) | 55.9 | # 3 | ||||
image-sentence alignment | VALSE foil-it (noun phrases) | LXMERT | pairwise accuracy | 87.1 | # 2 | |
Accuracy (%) | 70.8 | # 2 | ||||
image-sentence alignment | VALSE foil-it (noun phrases) | ViLBERT 12-in-1 | pairwise accuracy | 86.9 | # 3 | |
Accuracy (%) | 71.5 | # 1 | ||||
image-sentence alignment | VALSE foil-it (noun phrases) | VisualBERT | pairwise accuracy | 48.5 | # 7 | |
Accuracy (%) | 46.6 | # 4 | ||||
image-sentence alignment | VALSE foil-it (noun phrases) | CLIP | pairwise accuracy | 88.8 | # 1 | |
image-sentence alignment | VALSE foil-it (noun phrases) | GPT2 | pairwise accuracy | 80.7 | # 5 | |
image-sentence alignment | VALSE foil-it (noun phrases) | GPT1 | pairwise accuracy | 77.5 | # 6 | |
image-sentence alignment | VALSE plurality | LXMERT | Accuracy (%) | 55.1 | # 2 | |
pairwise accuracy | 64.4 | # 2 | ||||
image-sentence alignment | VALSE plurality | ViLBERT | Accuracy (%) | 50.3 | # 3 | |
pairwise accuracy | 61.2 | # 3 | ||||
image-sentence alignment | VALSE plurality | ViLBERT 12-in-1 | Accuracy (%) | 62.0 | # 1 | |
pairwise accuracy | 72.4 | # 1 | ||||
image-sentence alignment | VALSE plurality | VisualBERT | Accuracy (%) | 46.5 | # 4 | |
pairwise accuracy | 45.7 | # 7 | ||||
image-sentence alignment | VALSE plurality | CLIP | pairwise accuracy | 56.2 | # 4 | |
image-sentence alignment | VALSE plurality | GPT2 | pairwise accuracy | 51.9 | # 6 | |
image-sentence alignment | VALSE plurality | GPT1 | pairwise accuracy | 53.1 | # 5 | |
image-sentence alignment | VALSE spatial relations | GPT1 | pairwise accuracy | 77.2 | # 1 | |
image-sentence alignment | VALSE spatial relations | LXMERT | Accuracy (%) | 50.8 | # 2 | |
pairwise accuracy | 60.2 | # 5 | ||||
image-sentence alignment | VALSE spatial relations | ViLBERT | Accuracy (%) | 49.9 | # 3 | |
pairwise accuracy | 57.2 | # 6 | ||||
image-sentence alignment | VALSE spatial relations | ViLBERT 12-in-1 | Accuracy (%) | 53.4 | # 1 | |
pairwise accuracy | 67.7 | # 3 | ||||
image-sentence alignment | VALSE spatial relations | VisualBERT | Accuracy (%) | 49.3 | # 4 | |
pairwise accuracy | 39.7 | # 7 | ||||
image-sentence alignment | VALSE spatial relations | CLIP | pairwise accuracy | 64.3 | # 4 | |
image-sentence alignment | VALSE spatial relations | GPT2 | pairwise accuracy | 75.0 | # 2 |