TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Question Answering	DramaQA	LLaMA-VQA	Accuracy	84.1	# 1
Video Question Answering	NExT-QA	LLaMA-VQA (33B)	Accuracy	75.5	# 2
Video Question Answering	STAR Benchmark	LLaMA-VQA	Average Accuracy	65.4	# 2
Video Question Answering	TVQA	LLaMA-VQA	Accuracy	82.2	# 1
Video Question Answering	VLEP	LLaMA-VQA	Accuracy	71.0	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/large-language-models-are-temporal-and-causal/video-question-answering-on-dramaqa)](https://paperswithcode.com/sota/video-question-answering-on-dramaqa?p=large-language-models-are-temporal-and-causal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/large-language-models-are-temporal-and-causal/video-question-answering-on-tvqa)](https://paperswithcode.com/sota/video-question-answering-on-tvqa?p=large-language-models-are-temporal-and-causal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/large-language-models-are-temporal-and-causal/video-question-answering-on-vlep)](https://paperswithcode.com/sota/video-question-answering-on-vlep?p=large-language-models-are-temporal-and-causal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/large-language-models-are-temporal-and-causal/video-question-answering-on-next-qa)](https://paperswithcode.com/sota/video-question-answering-on-next-qa?p=large-language-models-are-temporal-and-causal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/large-language-models-are-temporal-and-causal/video-question-answering-on-situated)](https://paperswithcode.com/sota/video-question-answering-on-situated?p=large-language-models-are-temporal-and-causal)`

Large Language Models are Temporal and Causal Reasoners for Video Question Answering

24 Oct 2023 · Dohwan Ko, Ji Soo Lee, Wooyoung Kang, Byungseok Roh, Hyunwoo J. Kim ·

Large Language Models (LLMs) have shown remarkable performances on a wide range of natural language understanding and generation tasks. We observe that the LLMs provide effective priors in exploiting $\textit{linguistic shortcuts}$ for temporal and causal reasoning in Video Question Answering (VideoQA). However, such priors often cause suboptimal results on VideoQA by leading the model to over-rely on questions, $\textit{i.e.}$, $\textit{linguistic bias}$, while ignoring visual content. This is also known as `ungrounded guesses' or `hallucinations'. To address this problem while leveraging LLMs' prior on VideoQA, we propose a novel framework, Flipped-VQA, encouraging the model to predict all the combinations of $\langle$V, Q, A$\rangle$ triplet by flipping the source pair and the target label to understand their complex relationships, $\textit{i.e.}$, predict A, Q, and V given a VQ, VA, and QA pairs, respectively. In this paper, we develop LLaMA-VQA by applying Flipped-VQA to LLaMA, and it outperforms both LLMs-based and non-LLMs-based models on five challenging VideoQA benchmarks. Furthermore, our Flipped-VQA is a general framework that is applicable to various LLMs (OPT and GPT-J) and consistently improves their performances. We empirically demonstrate that Flipped-VQA not only enhances the exploitation of linguistic shortcuts but also mitigates the linguistic bias, which causes incorrect answers over-relying on the question. Code is available at https://github.com/mlvlab/Flipped-VQA.

PDF Abstract

Code

Add Remove Mark official

mlvlab/Flipped-VQA official

Tasks

Add Remove

Natural Language Understanding

Question Answering

Video Question Answering

Visual Question Answering (VQA)

Datasets

TVQA

NExT-QA

STAR Benchmark DramaQA

VLEP

Results from the Paper

Edit

Ranked #1 on Video Question Answering on TVQA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Question Answering	DramaQA	LLaMA-VQA	Accuracy	84.1	# 1	Compare
Video Question Answering	NExT-QA	LLaMA-VQA (33B)	Accuracy	75.5	# 2	Compare
Video Question Answering	STAR Benchmark	LLaMA-VQA	Average Accuracy	65.4	# 2	Compare
Video Question Answering	TVQA	LLaMA-VQA	Accuracy	82.2	# 1	Compare
Video Question Answering	VLEP	LLaMA-VQA	Accuracy	71.0	# 1	Compare

Methods

Add Remove

LLaMA

Edit Social Preview

Large Language Models are Temporal and Causal Reasoners for Video Question Answering

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove