TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	ActivityNet	TACo	text-to-video R@1	30.4	# 25
Video Retrieval	ActivityNet	TACo	text-to-video R@5	61.2	# 22
Video Retrieval	ActivityNet	TACo	text-to-video R@50	93.4	# 5
Video Retrieval	ActivityNet	TACo	text-to-video Median Rank	3.0	# 10
Action Segmentation	COIN	TACo	Frame accuracy	68.4	# 5
Temporal Action Localization	CrossTask	TACo	Recall	42.5	# 3
Video Retrieval	MSR-VTT	TACo	text-to-video R@1	24.8	# 29
Video Retrieval	MSR-VTT	TACo	text-to-video R@5	52.1	# 26
Video Retrieval	MSR-VTT	TACo	text-to-video R@10	64.0	# 24
Video Retrieval	MSR-VTT	TACo	text-to-video Median Rank	5	# 9
Zero-Shot Video Retrieval	MSR-VTT	TACo	text-to-video R@1	9.8	# 33
Zero-Shot Video Retrieval	MSR-VTT	TACo	text-to-video R@5	25.0	# 29
Zero-Shot Video Retrieval	MSR-VTT	TACo	text-to-video R@10	33.4	# 29
Video Retrieval	MSR-VTT-1kA	TACo	text-to-video R@1	28.4	# 48
Video Retrieval	MSR-VTT-1kA	TACo	text-to-video R@5	57.8	# 44
Video Retrieval	MSR-VTT-1kA	TACo	text-to-video R@10	71.2	# 46
Video Retrieval	MSR-VTT-1kA	TACo	text-to-video Median Rank	4	# 28
Video Retrieval	YouCook2	TACo	text-to-video Median Rank	4	# 3
Video Retrieval	YouCook2	TACo	text-to-video R@1	29.6	# 5
Video Retrieval	YouCook2	TACo	text-to-video R@10	72.7	# 5
Video Retrieval	YouCook2	TACo	text-to-video R@5	59.7	# 5
Zero-Shot Video Retrieval	YouCook2	TACo	text-to-video R@1	19.9	# 4
Zero-Shot Video Retrieval	YouCook2	TACo	text-to-video R@5	43.2	# 3
Zero-Shot Video Retrieval	YouCook2	TACo	text-to-video R@10	55.7	# 3
Zero-Shot Video Retrieval	YouCook2	TACo	text-to-video Mean Rank	8	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/taco-token-aware-cascade-contrastive-learning/temporal-action-localization-on-crosstask)](https://paperswithcode.com/sota/temporal-action-localization-on-crosstask?p=taco-token-aware-cascade-contrastive-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/taco-token-aware-cascade-contrastive-learning/zero-shot-video-retrieval-on-youcook2)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-youcook2?p=taco-token-aware-cascade-contrastive-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/taco-token-aware-cascade-contrastive-learning/action-segmentation-on-coin)](https://paperswithcode.com/sota/action-segmentation-on-coin?p=taco-token-aware-cascade-contrastive-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/taco-token-aware-cascade-contrastive-learning/video-retrieval-on-youcook2)](https://paperswithcode.com/sota/video-retrieval-on-youcook2?p=taco-token-aware-cascade-contrastive-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/taco-token-aware-cascade-contrastive-learning/video-retrieval-on-activitynet)](https://paperswithcode.com/sota/video-retrieval-on-activitynet?p=taco-token-aware-cascade-contrastive-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/taco-token-aware-cascade-contrastive-learning/video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt?p=taco-token-aware-cascade-contrastive-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/taco-token-aware-cascade-contrastive-learning/zero-shot-video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msr-vtt?p=taco-token-aware-cascade-contrastive-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/taco-token-aware-cascade-contrastive-learning/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=taco-token-aware-cascade-contrastive-learning)`

TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment

ICCV 2021 · Jianwei Yang, Yonatan Bisk, Jianfeng Gao ·

Contrastive learning has been widely used to train transformer-based vision-language models for video-text alignment and multi-modal representation learning. This paper presents a new algorithm called Token-Aware Cascade contrastive learning (TACo) that improves contrastive learning using two novel techniques. The first is the token-aware contrastive loss which is computed by taking into account the syntactic classes of words. This is motivated by the observation that for a video-text pair, the content words in the text, such as nouns and verbs, are more likely to be aligned with the visual contents in the video than the function words. Second, a cascade sampling method is applied to generate a small set of hard negative examples for efficient loss estimation for multi-modal fusion layers. To validate the effectiveness of TACo, in our experiments we finetune pretrained models for a set of downstream tasks including text-video retrieval (YouCook2, MSR-VTT and ActivityNet), video action step localization (CrossTask), video action segmentation (COIN). The results show that our models attain consistent improvements across different experimental settings over previous methods, setting new state-of-the-art on three public text-video retrieval benchmarks of YouCook2, MSR-VTT and ActivityNet.

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Action Segmentation

Contrastive Learning

Representation Learning

Retrieval

Temporal Action Localization

Video Retrieval

Zero-Shot Video Retrieval

Datasets

ActivityNet

MSR-VTT

HowTo100M

YouCook2 COIN

CrossTask

Results from the Paper

Edit

Ranked #3 on Temporal Action Localization on CrossTask (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	ActivityNet	TACo	text-to-video R@1	30.4	# 25	Compare
			text-to-video R@5	61.2	# 22	Compare
			text-to-video R@50	93.4	# 5	Compare
			text-to-video Median Rank	3.0	# 10	Compare
Action Segmentation	COIN	TACo	Frame accuracy	68.4	# 5	Compare
Temporal Action Localization	CrossTask	TACo	Recall	42.5	# 3	Compare
Video Retrieval	MSR-VTT	TACo	text-to-video R@1	24.8	# 29	Compare
			text-to-video R@5	52.1	# 26	Compare
			text-to-video R@10	64.0	# 24	Compare
			text-to-video Median Rank	5	# 9	Compare
Zero-Shot Video Retrieval	MSR-VTT	TACo	text-to-video R@1	9.8	# 33	Compare
			text-to-video R@5	25.0	# 29	Compare
			text-to-video R@10	33.4	# 29	Compare
Video Retrieval	MSR-VTT-1kA	TACo	text-to-video R@1	28.4	# 48	Compare
			text-to-video R@5	57.8	# 44	Compare
			text-to-video R@10	71.2	# 46	Compare
			text-to-video Median Rank	4	# 28	Compare
Video Retrieval	YouCook2	TACo	text-to-video Median Rank	4	# 3	Compare
			text-to-video R@1	29.6	# 5	Compare
			text-to-video R@10	72.7	# 5	Compare
			text-to-video R@5	59.7	# 5	Compare
Zero-Shot Video Retrieval	YouCook2	TACo	text-to-video R@1	19.9	# 4	Compare
			text-to-video R@5	43.2	# 3	Compare
			text-to-video R@10	55.7	# 3	Compare
			text-to-video Mean Rank	8	# 1	Compare

Methods

Add Remove

Contrastive Learning

Edit Social Preview

TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove