BLINK is a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations¹².

Most of the BLINK tasks can be solved by humans “within a blink” (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning)¹². However, these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language¹².

BLINK reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting¹². While humans get 95.70% accuracy on average, BLINK is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not “emerged” yet in recent multimodal LLMs¹².

The BLINK benchmark is designed to stimulate the community to help multimodal LLMs catch up with human-level visual perception¹². It includes diverse visual prompting, beyond recognition perception, and visual commonsense¹. The benchmark is available on GitHub¹.

(1) zeyofu/BLINK_Benchmark - GitHub. https://github.com/zeyofu/BLINK_Benchmark. (2) BLINK: Multimodal Large Language Models Can See but Not Perceive. https://arxiv.org/abs/2404.12390. (3) Releases · zeyofu/BLINK_Benchmark · GitHub. https://github.com/zeyofu/BLINK_Benchmark/releases. (4) undefined. https://doi.org/10.48550/arXiv.2404.12390.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


  • Unknown

Modalities


Languages