MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong multimodal representations. When finetuned, it sets state-of-the-art on Visual Commonsense Reasoning (VCR), TVQA, and Kinetics-600; outperforming prior work by 5%, 7%, and 1.5% respectively. Ablations show that these tasks benefit from audio pretraining -- even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why audio enables better vision-language representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Results from the Paper


 Ranked #1 on Action Classification on Kinetics-600 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Classification Kinetics-600 šŸ·MerlotReserve-Base (no Audio) Top-1 Accuracy 88.1 # 9
Top-5 Accuracy 95.8 # 24
Action Classification Kinetics-600 šŸ·MerlotReserve-Base (+Audio) Top-1 Accuracy 89.7 # 3
Top-5 Accuracy 96.6 # 13
Action Classification Kinetics-600 šŸ·MerlotReserve-Large (no Audio) Top-1 Accuracy 89.4 # 4
Top-5 Accuracy 96.3 # 19
Action Classification Kinetics-600 šŸ·MerlotReserve-Large (+Audio) Top-1 Accuracy 91.1 # 1
Top-5 Accuracy 97.1 # 10

Methods


No methods listed for this paper. Add relevant methods here