Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval

3 Dec 2021  ·  Fan Hu, Aozhu Chen, Ziyue Wang, Fangming Zhou, Jianfeng Dong, Xirong Li ·

In this paper we revisit feature fusion, an old-fashioned topic, in the new context of text-to-video retrieval. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the features is preferred to modeling their correlations by computationally heavy multi-head self attention. We propose Lightweight Attentional Feature Fusion (LAFF). LAFF performs feature fusion at both early and late stages and at both video and text ends, making it a powerful method for exploiting diverse (off-the-shelf) features. The interpretability of LAFF can be used for feature selection. Extensive experiments on five public benchmark sets (MSR-VTT, MSVD, TGIF, VATEX and TRECVID AVS 2016-2020) justify LAFF as a new baseline for text-to-video retrieval.

Results from the Paper

 Ranked #1 on Ad-hoc video search on TRECVID-AVS20 (V3C1) (using extra training data)

Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Video Retrieval MSR-VTT LAFF text-to-video R@1 29.1 # 13
text-to-video R@5 54.9 # 13
text-to-video R@10 65.8 # 11
Video Retrieval MSR-VTT-1kA LAFF text-to-video R@1 45.8 # 21
text-to-video R@5 71.5 # 19
text-to-video R@10 82 # 17
Video Retrieval MSVD LAFF text-to-video R@1 45.4 # 12
text-to-video R@5 76.0 # 10
text-to-video R@10 84.6 # 8
Video Retrieval TGIF LAFF text-to-video R@1 24.5 # 2
text-to-video R@5 45.0 # 2
text-to-video R@10 54.5 # 2
Ad-hoc video search TRECVID-AVS16 (IACC.3) LAFF infAP 0.222 # 1
Ad-hoc video search TRECVID-AVS17 (IACC.3) LAFF infAP 0.290 # 1
Ad-hoc video search TRECVID-AVS18 (IACC.3) LAFF infAP 0.147 # 1
Ad-hoc video search TRECVID-AVS19 (V3C1) LAFF infAP 0.192 # 1
Ad-hoc video search TRECVID-AVS20 (V3C1) LAFF infAP 0.265 # 1
Video Retrieval VATEX LAFF text-to-video R@1 59.1 # 3
text-to-video R@50 96.3 # 1
text-to-video R@10 91.7 # 4


