MMCOMPOSITION is a high-quality benchmark specifically designed to comprehensively evaluate the compositionality of pre-trained Vision-Language Models (VLMs) across three main dimensions—VL compositional perception, reasoning, and probing—which are further divided into 13 distinct categories of questions. While previous benchmarks have mainly focused on text-to-image retrieval, single-choice questions, and open-ended text generation, MMCOMPOSITION introduces a more diverse and challenging set of 4,342 tasks covering both single-image and multi-image scenarios, as well as single-choice and indefinite-choice formats. This expanded range of tasks aims to capture the complex interplay between vision and language more effectively, surpassing earlier benchmarks such as ARO and Winoground by providing a more comprehensive and in-depth assessment of models’ cross-modal compositional capabilities.
Paper | Code | Results | Date | Stars |
---|