Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while overlooking fine-grained cross-modal relationships, e.g., clip-phrase or frame-word. In this paper, we propose a novel method, named Hierarchical Cross-Modal Interaction (HCMI), to explore multi-level cross-modal relationships among video-sentence, clip-phrase, and frame-word for text-video retrieval. Considering intrinsic semantic frame relations, HCMI performs self-attention to explore frame-level correlations and adaptively cluster correlated frames into clip-level and video-level representations. In this way, HCMI constructs multi-level video representations for frame-clip-video granularities to capture fine-grained video content, and multi-level text representations at word-phrase-sentence granularities for the text modality. With multi-level representations for video and text, hierarchical contrastive learning is designed to explore fine-grained cross-modal relationships, i.e., frame-word, clip-phrase, and video-sentence, which enables HCMI to achieve a comprehensive semantic comparison between video and text modalities. Further boosted by adaptive label denoising and marginal sample enhancement, HCMI achieves new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 58.2%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet, respectively.
PDF AbstractDatasets
Results from the Paper
Ranked #1 on Video Retrieval on MSR-VTT-1kA (using extra training data)
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Uses Extra Training Data |
Benchmark |
---|---|---|---|---|---|---|---|
Video Retrieval | ActivityNet | HunYuan_tvr | text-to-video R@1 | 57.3 | # 9 | ||
text-to-video R@5 | 84.8 | # 6 | |||||
text-to-video R@10 | 93.1 | # 5 | |||||
text-to-video Median Rank | 1 | # 1 | |||||
text-to-video Mean Rank | 4.0 | # 3 | |||||
video-to-text R@1 | 57.7 | # 5 | |||||
video-to-text R@5 | 85.7 | # 3 | |||||
video-to-text R@10 | 93.9 | # 3 | |||||
video-to-text Median Rank | 1 | # 1 | |||||
video-to-text Mean Rank | 3.4 | # 3 | |||||
Video Retrieval | DiDeMo | HunYuan_tvr (huge) | text-to-video R@1 | 52.7 | # 18 | ||
text-to-video R@5 | 77.8 | # 20 | |||||
text-to-video R@10 | 85.2 | # 19 | |||||
text-to-video Median Rank | 1.0 | # 1 | |||||
text-to-video Mean Rank | 13.7 | # 8 | |||||
video-to-text R@1 | 54.1 | # 6 | |||||
video-to-text R@10 | 86.8 | # 5 | |||||
video-to-text Median Rank | 1.0 | # 1 | |||||
video-to-text Mean Rank | 9.1 | # 5 | |||||
video-to-text R@5 | 78.3 | # 5 | |||||
Video Retrieval | DiDeMo | HunYuan_tvr | text-to-video R@1 | 52.1 | # 21 | ||
text-to-video R@5 | 78.2 | # 19 | |||||
text-to-video R@10 | 85.7 | # 15 | |||||
text-to-video Median Rank | 1 | # 1 | |||||
text-to-video Mean Rank | 11.1 | # 3 | |||||
video-to-text R@1 | 54.8 | # 5 | |||||
video-to-text R@10 | 87.2 | # 4 | |||||
video-to-text Median Rank | 1 | # 1 | |||||
video-to-text Mean Rank | 7.1 | # 1 | |||||
video-to-text R@5 | 79.9 | # 3 | |||||
Video Retrieval | LSMDC | HunYuan_tvr | text-to-video R@1 | 29.7 | # 10 | ||
text-to-video R@5 | 46.4 | # 10 | |||||
text-to-video R@10 | 55.4 | # 11 | |||||
text-to-video Median Rank | 7 | # 5 | |||||
video-to-text R@1 | 30.1 | # 6 | |||||
video-to-text R@5 | 47.5 | # 4 | |||||
video-to-text R@10 | 55.7 | # 5 | |||||
video-to-text Median Rank | 7 | # 2 | |||||
text-to-video Mean Rank | 56.4 | # 8 | |||||
video-to-text Mean Rank | 48.9 | # 7 | |||||
Video Retrieval | LSMDC | HunYuan_tvr (huge) | text-to-video R@1 | 40.4 | # 4 | ||
text-to-video R@5 | 80.1 | # 1 | |||||
text-to-video R@10 | 92.8 | # 1 | |||||
text-to-video Median Rank | 2.0 | # 1 | |||||
video-to-text R@1 | 34.6 | # 5 | |||||
video-to-text R@5 | 71.8 | # 1 | |||||
video-to-text R@10 | 91.8 | # 1 | |||||
video-to-text Median Rank | 2.0 | # 1 | |||||
text-to-video Mean Rank | 3.9 | # 1 | |||||
video-to-text Mean Rank | 4.3 | # 1 | |||||
Video Retrieval | MSR-VTT-1kA | HunYuan_tvr | text-to-video R@1 | 55.0 | # 5 | ||
video-to-text R@1 | 55.5 | # 4 | |||||
video-to-text R@5 | 78.4 | # 5 | |||||
video-to-text R@10 | 85.8 | # 6 | |||||
video-to-text Median Rank | 1.0 | # 1 | |||||
video-to-text Mean Rank | 7.7 | # 7 | |||||
Video Retrieval | MSR-VTT-1kA | HunYuan_tvr (huge) | text-to-video Mean Rank | 9.3 | # 3 | ||
text-to-video R@1 | 62.9 | # 1 | |||||
text-to-video R@5 | 84.5 | # 1 | |||||
text-to-video R@10 | 90.8 | # 1 | |||||
text-to-video Median Rank | 1.0 | # 1 | |||||
video-to-text R@1 | 64.8 | # 1 | |||||
video-to-text R@5 | 84.9 | # 1 | |||||
video-to-text R@10 | 91.1 | # 1 | |||||
video-to-text Median Rank | 1.0 | # 1 | |||||
video-to-text Mean Rank | 5.5 | # 3 | |||||
Video Retrieval | MSVD | HunYuan_tvr | text-to-video R@1 | 58.2 | # 4 | ||
text-to-video R@5 | 83.5 | # 5 | |||||
text-to-video R@10 | 90.1 | # 2 | |||||
text-to-video Median Rank | 1 | # 1 | |||||
text-to-video Mean Rank | 7.8 | # 2 | |||||
video-to-text R@1 | 69.1 | # 7 | |||||
video-to-text R@5 | 91.5 | # 5 | |||||
video-to-text R@10 | 95.0 | # 5 | |||||
video-to-text Median Rank | 1.0 | # 1 | |||||
video-to-text Mean Rank | 3.8 | # 6 | |||||
Video Retrieval | MSVD | HunYuan_tvr (huge) | text-to-video R@1 | 59.0 | # 2 | ||
text-to-video R@5 | 84.0 | # 2 | |||||
text-to-video R@10 | 90.3 | # 1 | |||||
text-to-video Median Rank | 1.0 | # 1 | |||||
text-to-video Mean Rank | 7.6 | # 1 | |||||
video-to-text R@1 | 73.0 | # 4 | |||||
video-to-text R@5 | 94.5 | # 1 | |||||
video-to-text R@10 | 96.6 | # 2 | |||||
video-to-text Median Rank | 1.0 | # 1 | |||||
video-to-text Mean Rank | 7.6 | # 10 |