COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities... (read more)

PDF Abstract NeurIPS 2020 PDF NeurIPS 2020 Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
RESULT BENCHMARK
Video Captioning ActivityNet Captions COOT (ae-test split) - Only Appearance features BLEU-3 17.43 # 1
BLEU-4 10.85 # 1
ROUGE-L 31.45 # 1
METEOR 15.99 # 1
BLEU4 10.85 # 1
CIDEr 28.19 # 1
Video Retrieval YouCook2 COOT text-to-video Median Rank 9 # 2
text-to-video R@1 16.7 # 2
text-to-video R@10 52.3 # 2
Video Captioning YouCook2 COOT BLEU-3 17.97 # 2
BLEU-4 11.30 # 3
METEOR 19.85 # 2
ROUGE-L 37.94 # 3
CIDEr 0.57 # 3

Methods used in the Paper


METHOD TYPE
Adam
Stochastic Optimization
Residual Connection
Skip Connections
Dropout
Regularization
Multi-Head Attention
Attention Modules
BPE
Subword Segmentation
Softmax
Output Functions
Dense Connections
Feedforward Networks
Label Smoothing
Regularization
Layer Normalization
Normalization
Scaled Dot-Product Attention
Attention Mechanisms
Transformer
Transformers