TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Captioning	ActivityNet Captions	COOT (ae-test split) - Only Appearance features	BLEU-3	17.43	# 1
Video Captioning	ActivityNet Captions	COOT (ae-test split) - Only Appearance features	ROUGE-L	31.45	# 4
Video Captioning	ActivityNet Captions	COOT (ae-test split) - Only Appearance features	METEOR	15.99	# 3
Video Captioning	ActivityNet Captions	COOT (ae-test split) - Only Appearance features	BLEU4	10.85	# 4
Video Captioning	ActivityNet Captions	COOT (ae-test split) - Only Appearance features	CIDEr	28.19	# 4
Video Captioning	YouCook2	COOT	BLEU-3	17.97	# 3
Video Captioning	YouCook2	COOT	BLEU-4	11.30	# 8
Video Captioning	YouCook2	COOT	METEOR	19.85	# 3
Video Captioning	YouCook2	COOT	ROUGE-L	37.94	# 6
Video Captioning	YouCook2	COOT	CIDEr	0.57	# 11
Video Retrieval	YouCook2	COOT	text-to-video Median Rank	9	# 6
Video Retrieval	YouCook2	COOT	text-to-video R@1	16.7	# 10
Video Retrieval	YouCook2	COOT	text-to-video R@10	52.3	# 12

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coot-cooperative-hierarchical-transformer-for/video-captioning-on-activitynet-captions)](https://paperswithcode.com/sota/video-captioning-on-activitynet-captions?p=coot-cooperative-hierarchical-transformer-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coot-cooperative-hierarchical-transformer-for/video-captioning-on-youcook2)](https://paperswithcode.com/sota/video-captioning-on-youcook2?p=coot-cooperative-hierarchical-transformer-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coot-cooperative-hierarchical-transformer-for/video-retrieval-on-youcook2)](https://paperswithcode.com/sota/video-retrieval-on-youcook2?p=coot-cooperative-hierarchical-transformer-for)`

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

NeurIPS 2020 · Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, Thomas Brox ·

Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext

PDF Abstract NeurIPS 2020 PDF NeurIPS 2020 Abstract

Code

Add Remove Mark official

gingsi/coot-videotext official

286

Tasks

Add Remove

Cross-Modal Retrieval

Representation Learning

Sentence

Video Captioning

Video Retrieval

Video-Text Retrieval

Datasets

HowTo100M

ActivityNet Captions

YouCook2

Results from the Paper

Edit

Ranked #4 on Video Captioning on ActivityNet Captions

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Captioning	ActivityNet Captions	COOT (ae-test split) - Only Appearance features	BLEU-3	17.43	# 1	Compare
			ROUGE-L	31.45	# 4	Compare
			METEOR	15.99	# 3	Compare
			BLEU4	10.85	# 4	Compare
			CIDEr	28.19	# 4	Compare
Video Captioning	YouCook2	COOT	BLEU-3	17.97	# 3	Compare
			BLEU-4	11.30	# 8	Compare
			METEOR	19.85	# 3	Compare
			ROUGE-L	37.94	# 6	Compare
			CIDEr	0.57	# 11	Compare
Video Retrieval	YouCook2	COOT	text-to-video Median Rank	9	# 6	Compare
			text-to-video R@1	16.7	# 10	Compare
			text-to-video R@10	52.3	# 12	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove