COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at

PDF Abstract NeurIPS 2020 PDF NeurIPS 2020 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Video Captioning ActivityNet Captions COOT (ae-test split) - Only Appearance features BLEU-3 17.43 # 1
ROUGE-L 31.45 # 4
METEOR 15.99 # 3
BLEU4 10.85 # 4
CIDEr 28.19 # 4
Video Captioning YouCook2 COOT BLEU-3 17.97 # 3
BLEU-4 11.30 # 8
METEOR 19.85 # 3
ROUGE-L 37.94 # 6
CIDEr 0.57 # 9
Video Retrieval YouCook2 COOT text-to-video Median Rank 9 # 6
text-to-video R@1 16.7 # 10
text-to-video R@10 52.3 # 12