BLINK is a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations¹².
Most of the BLINK tasks can be solved by humans “within a blink” (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning)¹². However, these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language¹².
BLINK reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting¹². While humans get 95.70% accuracy on average, BLINK is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not “emerged” yet in recent multimodal LLMs¹².
The BLINK benchmark is designed to stimulate the community to help multimodal LLMs catch up with human-level visual perception¹². It includes diverse visual prompting, beyond recognition perception, and visual commonsense¹. The benchmark is available on GitHub¹.
(1) zeyofu/BLINK_Benchmark - GitHub. https://github.com/zeyofu/BLINK_Benchmark. (2) BLINK: Multimodal Large Language Models Can See but Not Perceive. https://arxiv.org/abs/2404.12390. (3) Releases · zeyofu/BLINK_Benchmark · GitHub. https://github.com/zeyofu/BLINK_Benchmark/releases. (4) undefined. https://doi.org/10.48550/arXiv.2404.12390.
Paper | Code | Results | Date | Stars |
---|