TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Dense Video Captioning	ActivityNet Captions	iPerceive (Chadha et al., 2020)	METEOR	7.87	# 10
Dense Video Captioning	ActivityNet Captions	iPerceive (Chadha et al., 2020)	BLEU-3	2.93	# 3
Dense Video Captioning	ActivityNet Captions	iPerceive (Chadha et al., 2020)	BLEU-4	1.29	# 5
Video Question Answering	TVQA	iPerceive (Chadha et al., 2020)	Accuracy	76.96	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/iperceive-applying-common-sense-reasoning-to-1/video-question-answering-on-tvqa)](https://paperswithcode.com/sota/video-question-answering-on-tvqa?p=iperceive-applying-common-sense-reasoning-to-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/iperceive-applying-common-sense-reasoning-to-1/dense-video-captioning-on-activitynet)](https://paperswithcode.com/sota/dense-video-captioning-on-activitynet?p=iperceive-applying-common-sense-reasoning-to-1)`

iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering

16 Nov 2020 · Aman Chadha, Gurneet Arora, Navpreet Kaloty ·

Most prior art in visual understanding relies solely on analyzing the "what" (e.g., event recognition) and "where" (e.g., event localization), which in some cases, fails to describe correct contextual relationships between events or leads to incorrect underlying visual attention. Part of what defines us as human and fundamentally different from machines is our instinct to seek causality behind any association, say an event Y that happened as a direct result of event X. To this end, we propose iPerceive, a framework capable of understanding the "why" between events in a video by building a common-sense knowledge base using contextual cues to infer causal relationships between objects in the video. We demonstrate the effectiveness of our technique using the dense video captioning (DVC) and video question answering (VideoQA) tasks. Furthermore, while most prior work in DVC and VideoQA relies solely on visual information, other modalities such as audio and speech are vital for a human observer's perception of an environment. We formulate DVC and VideoQA tasks as machine translation problems that utilize multiple modalities. By evaluating the performance of iPerceive DVC and iPerceive VideoQA on the ActivityNet Captions and TVQA datasets respectively, we show that our approach furthers the state-of-the-art. Code and samples are available at: iperceive.amanchadha.com.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Common Sense Reasoning

Dense Video Captioning

Machine Translation

Question Answering

Video Captioning

Video Question Answering

Datasets

ActivityNet Captions

TVQA

TVQA+

Results from the Paper

Edit

Ranked #4 on Video Question Answering on TVQA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Dense Video Captioning	ActivityNet Captions	iPerceive (Chadha et al., 2020)	METEOR	7.87	# 10	Compare
			BLEU-3	2.93	# 3	Compare
			BLEU-4	1.29	# 5	Compare
Video Question Answering	TVQA	iPerceive (Chadha et al., 2020)	Accuracy	76.96	# 4	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove