TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	ActivityNet	RTQ	text-to-video R@1	53.5	# 12
Video Retrieval	ActivityNet	RTQ	text-to-video R@5	81.4	# 7
Video Retrieval	ActivityNet	RTQ	text-to-video R@10	91.9	# 7
Video Retrieval	DiDeMo	RTQ	text-to-video R@1	57.6	# 10
Video Retrieval	DiDeMo	RTQ	text-to-video R@5	84.1	# 7
Video Retrieval	DiDeMo	RTQ	text-to-video R@10	89.9	# 7
Video Captioning	MSR-VTT	RTQ	CIDEr	69.3	# 9
Video Captioning	MSR-VTT	RTQ	ROUGE-L	66.1	# 6
Video Captioning	MSR-VTT	RTQ	BLEU-4	49.6	# 8
Video Retrieval	MSR-VTT-1kA	RTQ	text-to-video R@1	53.4	# 9
Video Retrieval	MSR-VTT-1kA	RTQ	text-to-video R@5	76.1	# 13
Video Retrieval	MSR-VTT-1kA	RTQ	text-to-video R@10	84.4	# 15
Video Captioning	MSVD	RTQ	CIDEr	123.4	# 9
Video Captioning	MSVD	RTQ	BLEU-4	66.9	# 6
Video Captioning	MSVD	RTQ	ROUGE-L	82.2	# 4
Video Question Answering	NExT-QA	RTQ	Accuracy	63.2	# 10

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rtq-rethinking-video-language-understanding/video-captioning-on-msr-vtt-1)](https://paperswithcode.com/sota/video-captioning-on-msr-vtt-1?p=rtq-rethinking-video-language-understanding)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rtq-rethinking-video-language-understanding/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=rtq-rethinking-video-language-understanding)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rtq-rethinking-video-language-understanding/video-captioning-on-msvd-1)](https://paperswithcode.com/sota/video-captioning-on-msvd-1?p=rtq-rethinking-video-language-understanding)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rtq-rethinking-video-language-understanding/video-retrieval-on-didemo)](https://paperswithcode.com/sota/video-retrieval-on-didemo?p=rtq-rethinking-video-language-understanding)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rtq-rethinking-video-language-understanding/video-question-answering-on-next-qa)](https://paperswithcode.com/sota/video-question-answering-on-next-qa?p=rtq-rethinking-video-language-understanding)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rtq-rethinking-video-language-understanding/video-retrieval-on-activitynet)](https://paperswithcode.com/sota/video-retrieval-on-activitynet?p=rtq-rethinking-video-language-understanding)`

RTQ: Rethinking Video-language Understanding Based on Image-text Model

1 Dec 2023 · Xiao Wang, Yaoyu Li, Tian Gan, Zheng Zhang, Jingjing Lv, Liqiang Nie ·

Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos. However, video-language understanding presents unique challenges due to the inclusion of highly complex semantic details, which result in information redundancy, temporal dependency, and scene complexity. Current techniques have only partially tackled these issues, and our quantitative analysis indicates that some of these methods are complementary. In light of this, we propose a novel framework called RTQ (Refine, Temporal model, and Query), which addresses these challenges simultaneously. The approach involves refining redundant information within frames, modeling temporal relations among frames, and querying task-specific information from the videos. Remarkably, our model demonstrates outstanding performance even in the absence of video-language pre-training, and the results are comparable with or superior to those achieved by state-of-the-art pre-training methods. Code is available at https://github.com/SCZwangxiao/RTQ-MM2023.

PDF Abstract

Code

Add Remove Mark official

SCZwangxiao/RTQ-MM2023 official

sczwangxiao/tsgvs-mm2023

Tasks

Add Remove

Video Captioning

Video Question Answering

Video Retrieval

Datasets

ActivityNet

MSR-VTT

MSVD

ActivityNet Captions

DiDeMo

WebVid

NExT-QA

Results from the Paper

Edit

Ranked #9 on Video Retrieval on MSR-VTT-1kA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	ActivityNet	RTQ	text-to-video R@1	53.5	# 12	Compare
			text-to-video R@5	81.4	# 7	Compare
			text-to-video R@10	91.9	# 7	Compare
Video Retrieval	DiDeMo	RTQ	text-to-video R@1	57.6	# 10	Compare
			text-to-video R@5	84.1	# 7	Compare
			text-to-video R@10	89.9	# 7	Compare
Video Captioning	MSR-VTT	RTQ	CIDEr	69.3	# 9	Compare
			ROUGE-L	66.1	# 6	Compare
			BLEU-4	49.6	# 8	Compare
Video Retrieval	MSR-VTT-1kA	RTQ	text-to-video R@1	53.4	# 9	Compare
			text-to-video R@5	76.1	# 13	Compare
			text-to-video R@10	84.4	# 15	Compare
Video Captioning	MSVD	RTQ	CIDEr	123.4	# 9	Compare
			BLEU-4	66.9	# 6	Compare
			ROUGE-L	82.2	# 4	Compare
Video Question Answering	NExT-QA	RTQ	Accuracy	63.2	# 10	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

RTQ: Rethinking Video-language Understanding Based on Image-text Model

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove