We propose a novel multimodal benchmark – the Perception Test – that aims to extensively evaluate perception and reasoning skills of multimodal models. The Perception Test introduces real-world videos designed to show perceptually interesting situations and defines multiple tasks that require understanding of memory, abstract patterns, physics, and semantics – across visual, audio, and text modalities. The benchmark consists of 11.6k videos, 23s average length, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels: object and point tracks, temporal action and sound segments, multiple-choice video question-answers and grounded video question-answers. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or finetuning regime. Evaluation results are provided as a multi-dimensional diagnostic report, detailing models’ strengths and weaknesses on various perception skills, computational tasks, and types of reasoning. Preliminary results from a human baseline compared to state-of-the-art video question answering models show a significant gap in performance (91.4% vs 36%) suggesting that perception is far from being solved. The training and validation splits of the benchmark are publicly available for download at https://github.com/deepmind/perception_test, under CC-BY license, together with per-task baseline results. We hope that the Perception Test will inspire and contribute to progress towards more general perception models.



Results from the Paper

  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.