Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

21 Nov 2022  ·  Peng Jin, Jinfa Huang, Fenglin Liu, Xian Wu, Shen Ge, Guoli Song, David A. Clifton, Jie Chen ·

Most video-and-language representation learning approaches employ contrastive learning, e.g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the modality gap between visual and textual representation can not be fully eliminated. In this paper, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases. Such feature decomposition of video-and-language representations reduces the rank of the latent space, resulting in increased representing power for the semantics. Extensive experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations than previous methods, and significantly outperform previous state-of-the-art methods across all metrics. More encouragingly, the proposed method can be applied to boost the performance of existing approaches either as a jointly training layer or an out-of-the-box inference module with no extra training, making it easy to be incorporated into any existing methods.

PDF Abstract

Results from the Paper


Ranked #2 on Video Retrieval on LSMDC (text-to-video Mean Rank metric)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video Retrieval ActivityNet EMCL-Net++ text-to-video R@1 50.6 # 14
text-to-video R@5 78.7 # 10
text-to-video R@50 98.1 # 2
text-to-video Mean Rank 1 # 1
video-to-text R@1 50.6 # 6
video-to-text R@5 78.9 # 4
video-to-text Mean Rank 1 # 1
video-to-text R@50 98.4 # 1
Video Retrieval ActivityNet EMCL-Net text-to-video R@1 41.2 # 22
text-to-video R@5 72.7 # 19
text-to-video Mean Rank 2 # 2
video-to-text R@1 42.7 # 11
video-to-text R@5 74 # 9
video-to-text Mean Rank 2 # 2
video-to-text R@50 98.3 # 2
Video Retrieval LSMDC EMCL-Net text-to-video R@1 23.9 # 22
text-to-video R@5 42.4 # 18
text-to-video R@10 50.9 # 18
video-to-text R@1 22.2 # 12
video-to-text R@5 40.6 # 9
video-to-text R@10 49.2 # 9
video-to-text Mean Rank 12 # 3
Video Retrieval LSMDC EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015) text-to-video R@10 53.7 # 15
text-to-video Mean Rank 8 # 2
Video Retrieval LSMDC EMCL-Net++ text-to-video R@1 25.9 # 15
text-to-video R@5 46.4 # 10
video-to-text R@1 26.7 # 8
video-to-text R@5 44.7 # 6
video-to-text R@10 54.4 # 6
video-to-text Mean Rank 8 # 2
Video Captioning MSR-VTT EMCL-Net CIDEr 54.6 # 20
METEOR 30.2 # 14
ROUGE-L 63.2 # 13
BLEU-4 45.3 # 14
Video Retrieval MSR-VTT-1kA EMCL-Net text-to-video Mean Rank 2 # 2
text-to-video R@1 46.8 # 28
text-to-video R@5 73.1 # 24
text-to-video R@10 83.1 # 22
video-to-text R@1 46.5 # 18
video-to-text R@5 73.5 # 16
video-to-text R@10 83.5 # 16
video-to-text Mean Rank 2 # 2
Video Retrieval MSR-VTT-1kA EMCL-Net++ text-to-video Mean Rank 1 # 1
text-to-video R@1 51.6 # 13
text-to-video R@5 78.1 # 9
text-to-video R@10 85.3 # 11
video-to-text R@1 51.8 # 7
video-to-text R@5 80.2 # 2
video-to-text R@10 88 # 2
video-to-text Mean Rank 1 # 1
Video Question Answering MSRVTT-QA EMCL-Net Accuracy 45.8 # 9
Visual Question Answering (VQA) MSRVTT-QA EMCL-Net Accuracy 0.458 # 13

Methods