Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two images and two captions, the goal is to match them correctly - but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance. We probe a diverse range of state-of-the-art vision and language models and find that, surprisingly, none of them do much better than chance. Evidently, these models are not as skilled at visio-linguistic compositional reasoning as we might have hoped. We perform an extensive analysis to obtain insights into how future work might try to mitigate these models' shortcomings. We aim for Winoground to serve as a useful evaluation set for advancing the state of the art and driving further progress in the field. The dataset is available at https://huggingface.co/datasets/facebook/winoground.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Visual Reasoning Winoground Random chance Text Score 25.00 # 85
Image Score 25.00 # 34
Group Score 16.67 # 40
Visual Reasoning Winoground VSRN (Flickr30k) Text Score 20.00 # 97
Image Score 5.00 # 108
Group Score 3.50 # 98
Visual Reasoning Winoground VisualBERT base Text Score 15.50 # 109
Image Score 2.50 # 110
Group Score 1.50 # 104
Visual Reasoning Winoground VSRN (COCO) Text Score 17.50 # 105
Image Score 7.00 # 102
Group Score 3.75 # 97
Visual Reasoning Winoground VSE++ (COCO, VGG) Text Score 18.75 # 103
Image Score 5.50 # 106
Group Score 3.50 # 98
Visual Reasoning Winoground LXMERT Text Score 19.25 # 101
Image Score 7.00 # 102
Group Score 4.00 # 93
Visual Reasoning Winoground UniT (ITM finetuned) Text Score 19.50 # 100
Image Score 6.25 # 104
Group Score 4.00 # 93
Visual Reasoning Winoground VSE++ (Flickr30k, VGG) Text Score 19.75 # 99
Image Score 6.25 # 104
Group Score 4.50 # 91
Visual Reasoning Winoground VSE++ (Flickr30k, ResNet) Text Score 20.00 # 97
Image Score 5.00 # 108
Group Score 2.75 # 101
Visual Reasoning Winoground VSE++ (COCO, ResNet) Text Score 22.75 # 92
Image Score 8.00 # 96
Group Score 4.00 # 93
Visual Reasoning Winoground ViLBERT base Text Score 23.75 # 89
Image Score 7.25 # 100
Group Score 4.75 # 90
Visual Reasoning Winoground FLAVA (contrastive) Text Score 25.25 # 84
Image Score 13.50 # 78
Group Score 9.00 # 75
Visual Reasoning Winoground ViLLA base Text Score 30.00 # 68
Image Score 12.00 # 84
Group Score 8.00 # 80
Visual Reasoning Winoground CLIP (ViT-B/32) Text Score 30.75 # 62
Image Score 10.50 # 91
Group Score 8.00 # 80
Visual Reasoning Winoground UNITER base Text Score 32.25 # 58
Image Score 13.25 # 80
Group Score 10.00 # 69
Visual Reasoning Winoground FLAVA (ITM) Text Score 32.25 # 58
Image Score 20.50 # 50
Group Score 14.25 # 48
Visual Reasoning Winoground ViLT (ViT-B/32) Text Score 34.75 # 52
Image Score 14.00 # 73
Group Score 9.25 # 74
Visual Reasoning Winoground ViLLA large Text Score 37.00 # 44
Image Score 13.25 # 80
Group Score 11.00 # 64
Visual Reasoning Winoground VinVL Text Score 37.75 # 43
Image Score 17.75 # 58
Group Score 14.50 # 46
Visual Reasoning Winoground UNITER large Text Score 38.00 # 41
Image Score 14.00 # 73
Group Score 10.50 # 66

Methods


No methods listed for this paper. Add relevant methods here