TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	ActivityNet	TESTA (ViT-B/16)	text-to-video R@1	54.8	# 11
Video Retrieval	ActivityNet	TESTA (ViT-B/16)	text-to-video R@5	80.8	# 9
Video Retrieval	ActivityNet	TESTA (ViT-B/16)	text-to-video R@10	89.6	# 9
Video Question Answering	ActivityNet-QA	TESTA (ViT-B/16)	Accuracy	45	# 17
Video Retrieval	Condensed Movies	TESTA (ViT-B/16)	text-to-video R@1	24.9	# 1
Video Retrieval	Condensed Movies	TESTA (ViT-B/16)	text-to-video R@5	46.5	# 1
Video Retrieval	Condensed Movies	TESTA (ViT-B/16)	text-to-video R@10	55.1	# 1
Video Retrieval	DiDeMo	TESTA (ViT-B/16)	text-to-video R@1	61.2	# 7
Video Retrieval	DiDeMo	TESTA (ViT-B/16)	text-to-video R@5	87.2	# 4
Video Retrieval	DiDeMo	TESTA (ViT-B/16)	text-to-video R@10	91.5	# 3
Video Retrieval	QuerYD	TESTA (ViT-B/16)	text-to-video R@1	83.4	# 1
Video Retrieval	QuerYD	TESTA (ViT-B/16)	text-to-video R@10	95.3	# 1
Video Retrieval	QuerYD	TESTA (ViT-B/16)	text-to-video R@5	93.8	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/testa-temporal-spatial-token-aggregation-for/video-retrieval-on-condensed-movies)](https://paperswithcode.com/sota/video-retrieval-on-condensed-movies?p=testa-temporal-spatial-token-aggregation-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/testa-temporal-spatial-token-aggregation-for/video-retrieval-on-queryd)](https://paperswithcode.com/sota/video-retrieval-on-queryd?p=testa-temporal-spatial-token-aggregation-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/testa-temporal-spatial-token-aggregation-for/video-retrieval-on-didemo)](https://paperswithcode.com/sota/video-retrieval-on-didemo?p=testa-temporal-spatial-token-aggregation-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/testa-temporal-spatial-token-aggregation-for/video-retrieval-on-activitynet)](https://paperswithcode.com/sota/video-retrieval-on-activitynet?p=testa-temporal-spatial-token-aggregation-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/testa-temporal-spatial-token-aggregation-for/video-question-answering-on-activitynet-qa)](https://paperswithcode.com/sota/video-question-answering-on-activitynet-qa?p=testa-temporal-spatial-token-aggregation-for)`

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

29 Oct 2023 · Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu sun, Lu Hou ·

Large-scale video-language pre-training has made remarkable strides in advancing video-language understanding tasks. However, the heavy computational burden of video encoding remains a formidable efficiency bottleneck, particularly for long-form videos. These videos contain massive visual tokens due to their inherent 3D properties and spatiotemporal redundancy, making it challenging to capture complex temporal and spatial relationships. To tackle this issue, we propose an efficient method called TEmporal-Spatial Token Aggregation (TESTA). TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame. TESTA can reduce the number of visual tokens by 75% and thus accelerate video encoding. Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video encoder block. We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks. Experimental results show that TESTA improves computing efficiency by 1.7 times, and achieves significant performance gains from its scalability in processing longer input frames, e.g., +13.7 R@1 on QuerYD and +6.5 R@1 on Condensed Movie.

PDF Abstract

Code

Add Remove Mark official

renshuhuai-andy/testa official

Tasks

Add Remove

Language Modelling

Retrieval

Video Question Answering

Video Retrieval

Video-Text Retrieval

Datasets

ActivityNet

ActivityNet Captions

DiDeMo

ActivityNet-QA Condensed Movies QuerYD

Results from the Paper

Edit

Ranked #1 on Video Retrieval on Condensed Movies (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	ActivityNet	TESTA (ViT-B/16)	text-to-video R@1	54.8	# 11	Compare
			text-to-video R@5	80.8	# 9	Compare
			text-to-video R@10	89.6	# 9	Compare
Video Question Answering	ActivityNet-QA	TESTA (ViT-B/16)	Accuracy	45	# 17	Compare
Video Retrieval	Condensed Movies	TESTA (ViT-B/16)	text-to-video R@1	24.9	# 1	Compare
			text-to-video R@5	46.5	# 1	Compare
			text-to-video R@10	55.1	# 1	Compare
Video Retrieval	DiDeMo	TESTA (ViT-B/16)	text-to-video R@1	61.2	# 7	Compare
			text-to-video R@5	87.2	# 4	Compare
			text-to-video R@10	91.5	# 3	Compare
Video Retrieval	QuerYD	TESTA (ViT-B/16)	text-to-video R@1	83.4	# 1	Compare
			text-to-video R@10	95.3	# 1	Compare
			text-to-video R@5	93.8	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove