COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext

PDF Abstract NeurIPS 2020 PDF NeurIPS 2020 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Video Captioning ActivityNet Captions COOT (ae-test split) - Only Appearance features BLEU-3 17.43 # 1
ROUGE-L 31.45 # 4
METEOR 15.99 # 3
BLEU4 10.85 # 4
CIDEr 28.19 # 4
Video Captioning YouCook2 COOT BLEU-3 17.97 # 3
BLEU-4 11.30 # 8
METEOR 19.85 # 3
ROUGE-L 37.94 # 6
CIDEr 0.57 # 12
Video Retrieval YouCook2 COOT text-to-video Median Rank 9 # 6
text-to-video R@1 16.7 # 10
text-to-video R@10 52.3 # 12

Methods