TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	DiDeMo	STAN	text-to-video R@1	54.6	# 15
Video Retrieval	DiDeMo	STAN	text-to-video R@5	78.4	# 17
Video Retrieval	DiDeMo	STAN	text-to-video R@10	85.1	# 20
Video Retrieval	DiDeMo	STAN	text-to-video Median Rank	1	# 1
Video Retrieval	LSMDC	STAN	text-to-video R@1	29.2	# 11
Video Retrieval	LSMDC	STAN	text-to-video R@5	49.5	# 8
Video Retrieval	LSMDC	STAN	text-to-video R@10	58.8	# 8
Video Retrieval	LSMDC	STAN	text-to-video Median Rank	6	# 3
Video Retrieval	MSR-VTT-1kA	STAN	text-to-video R@1	54.1	# 7
Video Retrieval	MSR-VTT-1kA	STAN	text-to-video R@5	79.5	# 5
Video Retrieval	MSR-VTT-1kA	STAN	text-to-video R@10	87.8	# 4
Video Retrieval	MSR-VTT-1kA	STAN	text-to-video Median Rank	1	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/revisiting-temporal-modeling-for-clip-based/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=revisiting-temporal-modeling-for-clip-based)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/revisiting-temporal-modeling-for-clip-based/video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/video-retrieval-on-lsmdc?p=revisiting-temporal-modeling-for-clip-based)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/revisiting-temporal-modeling-for-clip-based/video-retrieval-on-didemo)](https://paperswithcode.com/sota/video-retrieval-on-didemo?p=revisiting-temporal-modeling-for-clip-based)`

Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring

CVPR 2023 · Ruyang Liu, Jingjia Huang, Ge Li, Jiashi Feng, Xinglong Wu, Thomas H. Li ·

Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs, thus attracting increasing attention for their potential to improve visual representation learning in the video domain. In this paper, based on the CLIP model, we revisit temporal modeling in the context of image-to-video knowledge transferring, which is the key point for extending image-text pretrained models to the video domain. We find that current temporal modeling mechanisms are tailored to either high-level semantic-dominant tasks (e.g., retrieval) or low-level visual pattern-dominant tasks (e.g., recognition), and fail to work on the two cases simultaneously. The key difficulty lies in modeling temporal dependency while taking advantage of both high-level and low-level knowledge in CLIP model. To tackle this problem, we present Spatial-Temporal Auxiliary Network (STAN) -- a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks. Specifically, to realize both low-level and high-level knowledge transferring, STAN adopts a branch structure with decomposed spatial-temporal modules that enable multi-level CLIP features to be spatial-temporally contextualized. We evaluate our method on two representative video tasks: Video-Text Retrieval and Video Recognition. Extensive experiments demonstrate the superiority of our model over the state-of-the-art methods on various datasets, including MSR-VTT, DiDeMo, LSMDC, MSVD, Kinetics-400, and Something-Something-V2. Codes will be available at https://github.com/farewellthree/STAN

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Code

Add Remove Mark official

farewellthree/stan official

Tasks

Add Remove

Representation Learning

Retrieval

Text Retrieval

Video Recognition

Video Retrieval

Video-Text Retrieval

Datasets

Kinetics

Kinetics 400

MSR-VTT

Something-Something V2

DiDeMo

LSMDC

Results from the Paper

Edit

Ranked #7 on Video Retrieval on MSR-VTT-1kA (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	DiDeMo	STAN	text-to-video R@1	54.6	# 15	Compare
			text-to-video R@5	78.4	# 17	Compare
			text-to-video R@10	85.1	# 20	Compare
			text-to-video Median Rank	1	# 1	Compare
Video Retrieval	LSMDC	STAN	text-to-video R@1	29.2	# 11	Compare
			text-to-video R@5	49.5	# 8	Compare
			text-to-video R@10	58.8	# 8	Compare
			text-to-video Median Rank	6	# 3	Compare
Video Retrieval	MSR-VTT-1kA	STAN	text-to-video R@1	54.1	# 7	Compare
			text-to-video R@5	79.5	# 5	Compare
			text-to-video R@10	87.8	# 4	Compare
			text-to-video Median Rank	1	# 1	Compare

Methods

Add Remove

CLIP • fail

Edit Social Preview

Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove