MMCOMPOSITION is a high-quality benchmark specifically designed to comprehensively evaluate the compositionality of pre-trained Vision-Language Models (VLMs) across three main dimensions—VL compositional perception, reasoning, and probing—which are further divided into 13 distinct categories of questions. While previous benchmarks have mainly focused on text-to-image retrieval, single-choice questions, and open-ended text generation, MMCOMPOSITION introduces a more diverse and challenging set of 4,342 tasks covering both single-image and multi-image scenarios, as well as single-choice and indefinite-choice formats. This expanded range of tasks aims to capture the complex interplay between vision and language more effectively, surpassing earlier benchmarks such as ARO and Winoground by providing a more comprehensive and in-depth assessment of models’ cross-modal compositional capabilities.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


  • Unknown

Modalities


Languages