Perception Test: A Diagnostic Benchmark for Multimodal Models

Deep Mind 2022 · Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Skanda Koppula, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman and João Carreira ·

We propose a novel multimodal benchmark – the Perception Test – that aims to extensively evaluate perception and reasoning skills of multimodal models. The Perception Test introduces real-world videos designed to show perceptually interesting situations and defines multiple tasks that require understanding of memory, abstract patterns, physics, and semantics – across visual, audio, and text modalities. The benchmark consists of 11.6k videos, 23s average length, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels: object and point tracks, temporal action and sound segments, multiple-choice video question-answers and grounded video question-answers. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or finetuning regime. Evaluation results are provided as a multi-dimensional diagnostic report, detailing models’ strengths and weaknesses on various perception skills, computational tasks, and types of reasoning. Preliminary results from a human baseline compared to state-of-the-art video question answering models show a significant gap in performance (91.4% vs 36%) suggesting that perception is far from being solved. The training and validation splits of the benchmark are publicly available for download at https://github.com/deepmind/perception_test, under CC-BY license, together with per-task baseline results. We hope that the Perception Test will inspire and contribute to progress towards more general perception models.

PDF