Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

7 Apr 2022  ยท  Jie Jiang, Shaobo Min, Weijie Kong, Dihong Gong, Hongfa Wang, Zhifeng Li, Wei Liu ยท

Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while overlooking fine-grained cross-modal relationships, e.g., clip-phrase or frame-word. In this paper, we propose a novel method, named Hierarchical Cross-Modal Interaction (HCMI), to explore multi-level cross-modal relationships among video-sentence, clip-phrase, and frame-word for text-video retrieval. Considering intrinsic semantic frame relations, HCMI performs self-attention to explore frame-level correlations and adaptively cluster correlated frames into clip-level and video-level representations. In this way, HCMI constructs multi-level video representations for frame-clip-video granularities to capture fine-grained video content, and multi-level text representations at word-phrase-sentence granularities for the text modality. With multi-level representations for video and text, hierarchical contrastive learning is designed to explore fine-grained cross-modal relationships, i.e., frame-word, clip-phrase, and video-sentence, which enables HCMI to achieve a comprehensive semantic comparison between video and text modalities. Further boosted by adaptive label denoising and marginal sample enhancement, HCMI achieves new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 58.2%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet, respectively.

PDF Abstract

Results from the Paper


 Ranked #1 on Video Retrieval on MSR-VTT-1kA (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Video Retrieval ActivityNet HunYuan_tvr text-to-video R@1 57.3 # 9
text-to-video R@5 84.8 # 6
text-to-video R@10 93.1 # 5
text-to-video Median Rank 1 # 1
text-to-video Mean Rank 4.0 # 3
video-to-text R@1 57.7 # 5
video-to-text R@5 85.7 # 3
video-to-text R@10 93.9 # 3
video-to-text Median Rank 1 # 1
video-to-text Mean Rank 3.4 # 3
Video Retrieval DiDeMo HunYuan_tvr (huge) text-to-video R@1 52.7 # 18
text-to-video R@5 77.8 # 20
text-to-video R@10 85.2 # 19
text-to-video Median Rank 1.0 # 1
text-to-video Mean Rank 13.7 # 8
video-to-text R@1 54.1 # 6
video-to-text R@10 86.8 # 5
video-to-text Median Rank 1.0 # 1
video-to-text Mean Rank 9.1 # 5
video-to-text R@5 78.3 # 5
Video Retrieval DiDeMo HunYuan_tvr text-to-video R@1 52.1 # 21
text-to-video R@5 78.2 # 19
text-to-video R@10 85.7 # 15
text-to-video Median Rank 1 # 1
text-to-video Mean Rank 11.1 # 3
video-to-text R@1 54.8 # 5
video-to-text R@10 87.2 # 4
video-to-text Median Rank 1 # 1
video-to-text Mean Rank 7.1 # 1
video-to-text R@5 79.9 # 3
Video Retrieval LSMDC HunYuan_tvr text-to-video R@1 29.7 # 10
text-to-video R@5 46.4 # 10
text-to-video R@10 55.4 # 11
text-to-video Median Rank 7 # 5
video-to-text R@1 30.1 # 6
video-to-text R@5 47.5 # 4
video-to-text R@10 55.7 # 5
video-to-text Median Rank 7 # 2
text-to-video Mean Rank 56.4 # 8
video-to-text Mean Rank 48.9 # 7
Video Retrieval LSMDC HunYuan_tvr (huge) text-to-video R@1 40.4 # 4
text-to-video R@5 80.1 # 1
text-to-video R@10 92.8 # 1
text-to-video Median Rank 2.0 # 1
video-to-text R@1 34.6 # 5
video-to-text R@5 71.8 # 1
video-to-text R@10 91.8 # 1
video-to-text Median Rank 2.0 # 1
text-to-video Mean Rank 3.9 # 1
video-to-text Mean Rank 4.3 # 1
Video Retrieval MSR-VTT-1kA HunYuan_tvr text-to-video R@1 55.0 # 5
video-to-text R@1 55.5 # 4
video-to-text R@5 78.4 # 5
video-to-text R@10 85.8 # 6
video-to-text Median Rank 1.0 # 1
video-to-text Mean Rank 7.7 # 7
Video Retrieval MSR-VTT-1kA HunYuan_tvr (huge) text-to-video Mean Rank 9.3 # 3
text-to-video R@1 62.9 # 1
text-to-video R@5 84.5 # 1
text-to-video R@10 90.8 # 1
text-to-video Median Rank 1.0 # 1
video-to-text R@1 64.8 # 1
video-to-text R@5 84.9 # 1
video-to-text R@10 91.1 # 1
video-to-text Median Rank 1.0 # 1
video-to-text Mean Rank 5.5 # 3
Video Retrieval MSVD HunYuan_tvr text-to-video R@1 58.2 # 4
text-to-video R@5 83.5 # 5
text-to-video R@10 90.1 # 2
text-to-video Median Rank 1 # 1
text-to-video Mean Rank 7.8 # 2
video-to-text R@1 69.1 # 7
video-to-text R@5 91.5 # 5
video-to-text R@10 95.0 # 5
video-to-text Median Rank 1.0 # 1
video-to-text Mean Rank 3.8 # 6
Video Retrieval MSVD HunYuan_tvr (huge) text-to-video R@1 59.0 # 2
text-to-video R@5 84.0 # 2
text-to-video R@10 90.3 # 1
text-to-video Median Rank 1.0 # 1
text-to-video Mean Rank 7.6 # 1
video-to-text R@1 73.0 # 4
video-to-text R@5 94.5 # 1
video-to-text R@10 96.6 # 2
video-to-text Median Rank 1.0 # 1
video-to-text Mean Rank 7.6 # 10

Methods