Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

16 Jun 2022  ·  Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid ·

Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. Our code and models are publicly available at https://github.com/antoyang/FrozenBiLM.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Video Question Answering ActivityNet-QA FrozenBiLM (0-shot) Accuracy 25.9 # 29
Zero-Shot Video Question Answer ActivityNet-QA FrozenBiLM Confidence Score - # 16
Accuracy 24.7 # 15
Video Question Answering ActivityNet-QA FrozenBiLM Accuracy 43.2 # 18
Zero-Shot Video Question Answer EgoSchema (fullset) FrozenBiLM Accuracy 26.9 # 8
Video Question Answering How2QA FrozenBiLM Accuracy 86.7 # 2
Video Question Answering How2QA FrozenBiLM (0-shot) Accuracy 58.4 # 7
Zero-Shot Learning iVQA FrozenBiLM Accuracy 0.268 # 1
Video Question Answering iVQA FrozenBiLM Accuracy 39.6 # 2
Video Question Answering iVQA FrozenBiLM (0-shot) Accuracy 26.8 # 6
Zero-Shot Learning LSMDC FrozenBiLM Accuracy 51.5 # 1
Fill Mask LSMDC FrozenBiLM Accuracy 63.5 # 1
Video Question Answering MSRVTT-QA FrozenBiLM (0-shot) Accuracy 16.7 # 11
Zeroshot Video Question Answer MSRVTT-QA FrozeBiLM Accuracy 16.8 # 1
Confidence Score - # 1
Video Question Answering MSRVTT-QA FrozenBiLM Accuracy 47.0 # 5
Visual Question Answering MSRVTT-QA FrozenBiLM Accuracy 0.470 # 1
Visual Question Answering MSVD-QA FrozenBiLM Accuracy 0.548 # 1
Zeroshot Video Question Answer MSVD-QA FrozeBiLM Accuracy 32.2 # 1
Confidence Score - # 1
Zero-Shot Video Question Answer MSVD-QA FrozenBiLM Accuracy 33.8 # 14
Zero-Shot Video Question Answer TGIF-QA FrozenBiLM Accuracy 41.9 # 5
Zeroshot Video Question Answer TGIF-QA FrozenBiLM Accuracy 41.0 # 1
Confidence Score - # 1
TGIF-Frame TGIF-QA FrozenBiLM Accuracy 68.6 # 11
Zero-Shot Video Question Answer TVQA FrozenBiLM (with speech) Accuracy 59.7 # 1
Video Question Answering TVQA FrozenBiLM Accuracy 82 # 2
Zero-Shot Video Question Answer TVQA FrozenBILM (no speech) Accuracy 29.7 # 5

Methods