TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Dense Video Captioning	ActivityNet Captions	CM²	METEOR	8.55	# 8
Dense Video Captioning	ActivityNet Captions	CM²	CIDEr	33.01	# 2
Dense Video Captioning	ActivityNet Captions	CM²	SODA	6.18	# 2
Dense Video Captioning	ActivityNet Captions	CM²	BLEU4	2.38	# 1
Dense Video Captioning	ActivityNet Captions	CM²	F1	55.21	# 1
Dense Video Captioning	ActivityNet Captions	CM²	Recall	53.71	# 1
Dense Video Captioning	ActivityNet Captions	CM²	Precision	56.81	# 1
Dense Video Captioning	YouCook2	CM²	METEOR	6.08	# 3
Dense Video Captioning	YouCook2	CM²	CIDEr	31.66	# 3
Dense Video Captioning	YouCook2	CM²	BLEU4	1.63	# 1
Dense Video Captioning	YouCook2	CM²	SODA	5.34	# 3
Dense Video Captioning	YouCook2	CM²	F1	28.43	# 1
Dense Video Captioning	YouCook2	CM²	Recall	24.76	# 1
Dense Video Captioning	YouCook2	CM²	Precision	33.38	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/do-you-remember-dense-video-captioning-with/dense-video-captioning-on-youcook2)](https://paperswithcode.com/sota/dense-video-captioning-on-youcook2?p=do-you-remember-dense-video-captioning-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/do-you-remember-dense-video-captioning-with/dense-video-captioning-on-activitynet)](https://paperswithcode.com/sota/dense-video-captioning-on-activitynet?p=do-you-remember-dense-video-captioning-with)`

Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval

11 Apr 2024 · Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, Seong Tae Kim ·

There has been significant attention to the research on dense video captioning, which aims to automatically localize and caption all events within untrimmed video. Several studies introduce methods by designing dense video captioning as a multitasking problem of event localization and event captioning to consider inter-task relations. However, addressing both tasks using only visual input is challenging due to the lack of semantic content. In this study, we address this by proposing a novel framework inspired by the cognitive information processing of humans. Our model utilizes external memory to incorporate prior knowledge. The memory retrieval method is proposed with cross-modal video-to-text matching. To effectively incorporate retrieved text features, the versatile encoder and the decoder with visual and textual cross-attention modules are designed. Comparative experiments have been conducted to show the effectiveness of the proposed method on ActivityNet Captions and YouCook2 datasets. Experimental results show promising performance of our model without extensive pretraining from a large video dataset.

PDF Abstract

Code

Add Remove Mark official

ailab-kyunghee/cm2_dvc official

faceonlive/ai-research

↳ Quickstart in

Spaces

181

Tasks

Add Remove

Dense Video Captioning

Retrieval

Text Matching

Video Captioning

Datasets

ActivityNet Captions

YouCook2

Results from the Paper

Edit

Ranked #3 on Dense Video Captioning on YouCook2

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Dense Video Captioning	ActivityNet Captions	CM²	METEOR	8.55	# 8	Compare
			CIDEr	33.01	# 2	Compare
			SODA	6.18	# 2	Compare
			BLEU4	2.38	# 1	Compare
			F1	55.21	# 1	Compare
			Recall	53.71	# 1	Compare
			Precision	56.81	# 1	Compare
Dense Video Captioning	YouCook2	CM²	METEOR	6.08	# 3	Compare
			CIDEr	31.66	# 3	Compare
			BLEU4	1.63	# 1	Compare
			SODA	5.34	# 3	Compare
			F1	28.43	# 1	Compare
			Recall	24.76	# 1	Compare
			Precision	33.38	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove