SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

22 Jul 2024  ·  Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, Afshin Dehghan ·

We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language model (LLM) that can jointly capture detailed spatial semantics and long-range temporal context without exceeding the token budget of commonly used LLMs. This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled frames in an effective way. Specifically, the Slow pathway extracts features at a low frame rate while keeping as much spatial detail as possible (e.g., with 12x24 tokens), and the Fast pathway operates on a high frame rate but uses a larger spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. As a result, this design allows us to adequately capture both spatial and temporal features that are beneficial for detailed video understanding. Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks. On some benchmarks, it achieves comparable or even better performance compared to state-of-the-art Video LLMs that are fine-tuned on video datasets. Code has been made available at: https://github.com/apple/ml-slowfast-llava.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Zero-Shot Video Question Answer ActivityNet-QA SlowFast-LLaVA-34B Confidence Score 3.5 # 7
Accuracy 59.2 # 5
Zero-Shot Video Question Answer EgoSchema (subset) SlowFast-LLaVA-34B Accuracy 47.2 # 11
Zero-Shot Video Question Answer IntentQA SlowFast-LLaVA-34B Accuracy 60.1 # 7
Zero-Shot Video Question Answer MSRVTT-QA SlowFast-LLaVA-34B Accuracy 67.4 # 4
Confidence Score 3.7 # 1
Zero-Shot Video Question Answer MSVD-QA SlowFast-LLaVA-34B Accuracy 79.9 # 4
Confidence Score 4.1 # 3
Zero-Shot Video Question Answer NExT-QA SlowFast-LLaVA-34B Accuracy 64.2 # 14
Zero-Shot Video Question Answer TGIF-QA SlowFast-LLaVA-34B Accuracy 80.6 # 3
Confidence Score 4.3 # 2
Video-based Generative Performance Benchmarking (Correctness of Information) VideoInstruct SlowFast-LLaVA-34B gpt-score 3.48 # 4
Video-based Generative Performance Benchmarking (Temporal Understanding) VideoInstruct SlowFast-LLaVA-34B gpt-score 2.77 # 4
Video-based Generative Performance Benchmarking (Contextual Understanding) VideoInstruct SlowFast-LLaVA-34B gpt-score 3.84 # 4
Video-based Generative Performance Benchmarking (Detail Orientation)) VideoInstruct SlowFast-LLaVA-34B gpt-score 2.96 # 8
Video-based Generative Performance Benchmarking (Consistency) VideoInstruct SlowFast-LLaVA-34B gpt-score 3.57 # 3
Video-based Generative Performance Benchmarking VideoInstruct SlowFast-LLaVA-34B mean 3.32 # 4

Methods