Measuring CLEVRness: Black-box Testing of Visual Reasoning Models

How to measure the reasoning capabilities of intelligence systems? Visual question answering provides a convenient framework for testing the model's abilities by interrogating the model through questions about the scene. However, despite the score of various visual QA datasets and architectures, which sometimes yield even a super-human performance, the question of whether those architectures can do reasoning remains open. To answer such a question, we extend the visual question answering framework and propose the following behavioral test in the form of a two-player game. We consider black-box neural models of CLEVR. Those models are trained on a diagnostic dataset benchmarking reasoning. Next, we train an adversarial player that re-configures the scene to fool the CLEVR model. We show that CLEVR models, which otherwise may perform at a human level, can easily be fooled by our agent. Our results question one more time whether data-driven approaches can do reasoning without exploiting numerous biases that are often present in those datasets. Finally, we also propose a controlled experiment measuring the efficiency of such models to learn and perform reasoning.

PDF Abstract

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods