TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Video Question Answer	ActivityNet-QA	IG-VLM	Confidence Score	3.5	# 3
Zero-Shot Video Question Answer	ActivityNet-QA	IG-VLM	Accuracy	58.4	# 2
Zero-Shot Video Question Answer	IntentQA	IG-VLM	Accuracy	65.3	# 1
Zero-Shot Video Question Answer	MSRVTT-QA	IG-VLM	Accuracy	63.8	# 3
Zero-Shot Video Question Answer	MSRVTT-QA	IG-VLM	Confidence Score	3.5	# 2
Zero-Shot Video Question Answer	MSVD-QA	IG-VLM	Accuracy	79.6	# 2
Zero-Shot Video Question Answer	MSVD-QA	IG-VLM	Confidence Score	4.1	# 2
Zero-Shot Video Question Answer	NExT-QA	IG-VLM	Accuracy	70.9	# 2
Zero-Shot Video Question Answer	STAR Benchmark	IG-VLM	Accuracy	53.0	# 2
Zero-Shot Video Question Answer	TGIF-QA	IG-VLM	Accuracy	79.1	# 2
Zero-Shot Video Question Answer	TGIF-QA	IG-VLM	Confidence Score	4.2	# 2
Zero-Shot Video Question Answer	TVQA	IG-VLM (no speech)	Accuracy	57.8	# 2
Video-based Generative Performance Benchmarking	VideoInstruct	IG-VLM	Correctness of Information	3.40	# 2
Video-based Generative Performance Benchmarking	VideoInstruct	IG-VLM	Detail Orientation	2.80	# 10
Video-based Generative Performance Benchmarking	VideoInstruct	IG-VLM	Contextual Understanding	3.61	# 3
Video-based Generative Performance Benchmarking	VideoInstruct	IG-VLM	Temporal Understanding	2.89	# 2
Video-based Generative Performance Benchmarking	VideoInstruct	IG-VLM	Consistency	3.13	# 3
Video-based Generative Performance Benchmarking	VideoInstruct	IG-VLM	mean	3.17	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-image-grid-can-be-worth-a-video-zero-shot/zero-shot-video-question-answer-on-intentqa)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-intentqa?p=an-image-grid-can-be-worth-a-video-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-image-grid-can-be-worth-a-video-zero-shot/zeroshot-video-question-answer-on-activitynet)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-activitynet?p=an-image-grid-can-be-worth-a-video-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-image-grid-can-be-worth-a-video-zero-shot/zeroshot-video-question-answer-on-msvd-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msvd-qa?p=an-image-grid-can-be-worth-a-video-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-image-grid-can-be-worth-a-video-zero-shot/zero-shot-video-question-answer-on-next-qa)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-next-qa?p=an-image-grid-can-be-worth-a-video-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-image-grid-can-be-worth-a-video-zero-shot/zero-shot-video-question-answer-on-star-1)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-star-1?p=an-image-grid-can-be-worth-a-video-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-image-grid-can-be-worth-a-video-zero-shot/zeroshot-video-question-answer-on-tgif-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-tgif-qa?p=an-image-grid-can-be-worth-a-video-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-image-grid-can-be-worth-a-video-zero-shot/zero-shot-video-question-answer-on-tvqa)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-tvqa?p=an-image-grid-can-be-worth-a-video-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-image-grid-can-be-worth-a-video-zero-shot/video-based-generative-performance)](https://paperswithcode.com/sota/video-based-generative-performance?p=an-image-grid-can-be-worth-a-video-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-image-grid-can-be-worth-a-video-zero-shot/zeroshot-video-question-answer-on-msrvtt-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msrvtt-qa?p=an-image-grid-can-be-worth-a-video-zero-shot)`

An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM

27 Mar 2024 · Wonkyun Kim, Changin Choi, Wonseok Lee, Wonjong Rhee ·

Stimulated by the sophisticated reasoning capabilities of recent Large Language Models (LLMs), a variety of strategies for bridging video modality have been devised. A prominent strategy involves Video Language Models (VideoLMs), which train a learnable interface with video data to connect advanced vision encoders with LLMs. Recently, an alternative strategy has surfaced, employing readily available foundation models, such as VideoLMs and LLMs, across multiple stages for modality bridging. In this study, we introduce a simple yet novel strategy where only a single Vision Language Model (VLM) is utilized. Our starting point is the plain insight that a video comprises a series of images, or frames, interwoven with temporal information. The essence of video comprehension lies in adeptly managing the temporal aspects along with the spatial details of each frame. Initially, we transform a video into a single composite image by arranging multiple frames in a grid layout. The resulting single image is termed as an image grid. This format, while maintaining the appearance of a solitary image, effectively retains temporal information within the grid structure. Therefore, the image grid approach enables direct application of a single high-performance VLM without necessitating any video-data training. Our extensive experimental analysis across ten zero-shot video question answering benchmarks, including five open-ended and five multiple-choice benchmarks, reveals that the proposed Image Grid Vision Language Model (IG-VLM) surpasses the existing methods in nine out of ten benchmarks.

PDF Abstract

Code

Add Remove Mark official

imagegridworth/IG-VLM official

Tasks

Add Remove

Language Modelling

Multiple-choice

Question Answering

Video-based Generative Performance Benchmarking

Video Question Answering

Zero-Shot Video Question Answer

Datasets

TVQA

ActivityNet-QA

TGIF-QA

NExT-QA MSRVTT-QA MSVD-QA EgoSchema VideoInstruct

STAR Benchmark IntentQA

Results from the Paper

Edit

Ranked #1 on Zero-Shot Video Question Answer on IntentQA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Video Question Answer	ActivityNet-QA	IG-VLM	Confidence Score	3.5	# 3	Compare
Zero-Shot Video Question Answer	ActivityNet-QA	IG-VLM	Accuracy	58.4	# 2	Compare
Zero-Shot Video Question Answer	IntentQA	IG-VLM	Accuracy	65.3	# 1	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	IG-VLM	Accuracy	63.8	# 3	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	IG-VLM	Confidence Score	3.5	# 2	Compare
Zero-Shot Video Question Answer	MSVD-QA	IG-VLM	Accuracy	79.6	# 2	Compare
Zero-Shot Video Question Answer	MSVD-QA	IG-VLM	Confidence Score	4.1	# 2	Compare
Zero-Shot Video Question Answer	NExT-QA	IG-VLM	Accuracy	70.9	# 2	Compare
Zero-Shot Video Question Answer	STAR Benchmark	IG-VLM	Accuracy	53.0	# 2	Compare
Zero-Shot Video Question Answer	TGIF-QA	IG-VLM	Accuracy	79.1	# 2	Compare
Zero-Shot Video Question Answer	TGIF-QA	IG-VLM	Confidence Score	4.2	# 2	Compare
Zero-Shot Video Question Answer	TVQA	IG-VLM (no speech)	Accuracy	57.8	# 2	Compare
Video-based Generative Performance Benchmarking	VideoInstruct	IG-VLM	Correctness of Information	3.40	# 2	Compare
			Detail Orientation	2.80	# 10	Compare
			Contextual Understanding	3.61	# 3	Compare
			Temporal Understanding	2.89	# 2	Compare
			Consistency	3.13	# 3	Compare
			mean	3.17	# 2	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove