2 papers with code • 12 benchmarks • 1 datasets
Predict the alignment (score) between an image and a sentence.
Removing Word-Level Spurious Alignment between Images and Pseudo-Captions in Unsupervised Image Captioning
Unsupervised image captioning is a challenging task that aims at generating captions without the supervision of image-sentence pairs, but only with images and sentences drawn from different sources and object labels detected from the images.
We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena.