TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Question Answering	ActivityNet-QA	VIOLET+	Accuracy	39.7	# 25
Video Question Answering	ActivityNet-QA	All-in-one+	Accuracy	40.0	# 24
Video Question Answering	ActivityNet-QA	FrozenBiLM+	Accuracy	44.8	# 18
Visual Question Answering (VQA)	MSRVTT-QA	FrozenBiLM+	Accuracy	0.470	# 8
Visual Question Answering (VQA)	MSRVTT-QA	JustAsk+	Accuracy	0.418	# 22
Visual Question Answering (VQA)	MSRVTT-QA	All-in-one+	Accuracy	0.395	# 23
Visual Question Answering (VQA)	MSVD-QA	All-in-one+	Accuracy	0.438	# 28
Visual Question Answering (VQA)	MSVD-QA	FrozenBiLM+	Accuracy	0.558	# 10
Visual Question Answering (VQA)	MSVD-QA	VIOLET+	Accuracy	0.495	# 22
Visual Question Answering (VQA)	MSVD-QA	JustAsk+	Accuracy	0.477	# 26
TGIF-Frame	TGIF-QA	FrozenBiLM+	Accuracy	69.0	# 10
TGIF-Frame	TGIF-QA	VIOLET+	Accuracy	65.3	# 14
TGIF-Frame	TGIF-QA	JustAsk+	Accuracy	57.4	# 17
TGIF-Frame	TGIF-QA	All-in-one+	Accuracy	66.0	# 13

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/open-vocabulary-video-question-answering-a/visual-question-answering-on-msrvtt-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msrvtt-qa-1?p=open-vocabulary-video-question-answering-a)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/open-vocabulary-video-question-answering-a/visual-question-answering-on-msvd-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msvd-qa-1?p=open-vocabulary-video-question-answering-a)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/open-vocabulary-video-question-answering-a/tgif-frame-on-tgif-qa)](https://paperswithcode.com/sota/tgif-frame-on-tgif-qa?p=open-vocabulary-video-question-answering-a)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/open-vocabulary-video-question-answering-a/video-question-answering-on-activitynet-qa)](https://paperswithcode.com/sota/video-question-answering-on-activitynet-qa?p=open-vocabulary-video-question-answering-a)`

Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models

ICCV 2023 · Dohwan Ko, Ji Soo Lee, Miso Choi, Jaewon Chu, Jihwan Park, Hyunwoo J. Kim ·

Video Question Answering (VideoQA) is a challenging task that entails complex multi-modal reasoning. In contrast to multiple-choice VideoQA which aims to predict the answer given several options, the goal of open-ended VideoQA is to answer questions without restricting candidate answers. However, the majority of previous VideoQA models formulate open-ended VideoQA as a classification task to classify the video-question pairs into a fixed answer set, i.e., closed-vocabulary, which contains only frequent answers (e.g., top-1000 answers). This leads the model to be biased toward only frequent answers and fail to generalize on out-of-vocabulary answers. We hence propose a new benchmark, Open-vocabulary Video Question Answering (OVQA), to measure the generalizability of VideoQA models by considering rare and unseen answers. In addition, in order to improve the model's generalization power, we introduce a novel GNN-based soft verbalizer that enhances the prediction on rare and unseen answers by aggregating the information from their similar words. For evaluation, we introduce new baselines by modifying the existing (closed-vocabulary) open-ended VideoQA models and improve their performances by further taking into account rare and unseen answers. Our ablation studies and qualitative analyses demonstrate that our GNN-based soft verbalizer further improves the model performance, especially on rare and unseen answers. We hope that our benchmark OVQA can serve as a guide for evaluating the generalizability of VideoQA models and inspire future research. Code is available at https://github.com/mlvlab/OVQA.

PDF Abstract ICCV 2023 PDF ICCV 2023 Abstract

Code

Add Remove Mark official

mlvlab/ovqa official

Tasks

Add Remove

Multiple-choice

Question Answering

TGIF-Frame

Video Question Answering

Visu

Visual Question Answering (VQA)

Datasets

GQA

ActivityNet-QA

TGIF-QA MSRVTT-QA MSVD-QA

Results from the Paper

Edit

Ranked #8 on Visual Question Answering (VQA) on MSRVTT-QA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Question Answering	ActivityNet-QA	VIOLET+	Accuracy	39.7	# 25	Compare
Video Question Answering	ActivityNet-QA	All-in-one+	Accuracy	40.0	# 24	Compare
Video Question Answering	ActivityNet-QA	FrozenBiLM+	Accuracy	44.8	# 18	Compare
Visual Question Answering (VQA)	MSRVTT-QA	FrozenBiLM+	Accuracy	0.470	# 8	Compare
Visual Question Answering (VQA)	MSRVTT-QA	JustAsk+	Accuracy	0.418	# 22	Compare
Visual Question Answering (VQA)	MSRVTT-QA	All-in-one+	Accuracy	0.395	# 23	Compare
Visual Question Answering (VQA)	MSVD-QA	All-in-one+	Accuracy	0.438	# 28	Compare
Visual Question Answering (VQA)	MSVD-QA	FrozenBiLM+	Accuracy	0.558	# 10	Compare
Visual Question Answering (VQA)	MSVD-QA	VIOLET+	Accuracy	0.495	# 22	Compare
Visual Question Answering (VQA)	MSVD-QA	JustAsk+	Accuracy	0.477	# 26	Compare
TGIF-Frame	TGIF-QA	FrozenBiLM+	Accuracy	69.0	# 10	Compare
TGIF-Frame	TGIF-QA	VIOLET+	Accuracy	65.3	# 14	Compare
TGIF-Frame	TGIF-QA	JustAsk+	Accuracy	57.4	# 17	Compare
TGIF-Frame	TGIF-QA	All-in-one+	Accuracy	66.0	# 13	Compare

Methods

Add Remove

fail

Edit Social Preview

Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove