This dataset is collected via the WinoGAViL game to collect challenging vision-and-language associations. Inspired by the popular card game Codenames, a “spymaster” gives a textual cue related to several visual candidates, and another player has to identify them.
We use the game to collect 3.5K instances, finding that they are intuitive for humans (>90% Jaccard index) but challenging for state-of-the-art AI models, where the best model (ViLT) achieves a score of 52%, succeeding mostly where the cue is visually salient.
Researchers are welcome to evaluate models on this dataset. A simple intended use is zero-shot prediction: run vision-and-language model, producing a score for the (cue,image) pair, and taking the K pairs with the highest scores.
A supervised setting is also possible, code for re-running the experiments is available in the github repository. https://github.com/WinoGAViL/WinoGAViL-experiments
Paper | Code | Results | Date | Stars |
---|