DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

Existing text-video retrieval solutions are, in essence, discriminant models focused on maximizing the conditional likelihood, i.e., p(candidates|query). While straightforward, this de facto paradigm overlooks the underlying data distribution p(query), which makes it challenging to identify out-of-distribution data. To address this limitation, we creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query). This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise. During training, DiffusionRet is optimized from both the generation and discrimination perspectives, with the generator being optimized by generation loss and the feature extractor trained with contrastive loss. In this way, DiffusionRet cleverly leverages the strengths of both generative and discriminative methods. Extensive experiments on five commonly used text-video retrieval benchmarks, including MSRVTT, LSMDC, MSVD, ActivityNet Captions, and DiDeMo, with superior performances, justify the efficacy of our method. More encouragingly, without any modification, DiffusionRet even performs well in out-domain retrieval settings. We believe this work brings fundamental insights into the related fields. Code is available at https://github.com/jpthu17/DiffusionRet.

PDF Abstract ICCV 2023 PDF ICCV 2023 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video Retrieval ActivityNet DiffusionRet+QB-Norm text-to-video R@1 48.1 # 13
text-to-video R@10 85.7 # 12
text-to-video Median Rank 2.0 # 5
text-to-video Mean Rank 6.8 # 9
video-to-text R@1 47.4 # 6
video-to-text R@5 76.3 # 5
video-to-text R@10 86.7 # 4
video-to-text Median Rank 2.0 # 2
video-to-text Mean Rank 6.7 # 8
Video Retrieval ActivityNet DiffusionRet text-to-video R@1 45.8 # 17
text-to-video R@5 75.6 # 12
text-to-video R@10 86.3 # 11
text-to-video Median Rank 2.0 # 5
text-to-video Mean Rank 6.5 # 7
video-to-text R@1 43.8 # 9
video-to-text R@5 75.3 # 7
video-to-text R@10 86.7 # 4
video-to-text Median Rank 2.0 # 2
video-to-text Mean Rank 6.3 # 5
Video Retrieval DiDeMo DiffusionRet text-to-video R@1 46.7 # 26
text-to-video R@5 74.7 # 25
text-to-video R@10 82.7 # 23
text-to-video Median Rank 2.0 # 9
text-to-video Mean Rank 14.3 # 10
video-to-text R@1 46.2 # 11
video-to-text R@10 82.2 # 9
video-to-text Median Rank 2.0 # 5
video-to-text Mean Rank 10.7 # 10
video-to-text R@5 74.3 # 6
Video Retrieval DiDeMo DiffusionRet+QB-Norm text-to-video R@1 48.9 # 21
text-to-video R@5 75.5 # 23
text-to-video R@10 83.3 # 22
text-to-video Median Rank 2.0 # 9
text-to-video Mean Rank 14.1 # 9
video-to-text R@1 50.3 # 7
video-to-text R@10 82.9 # 7
video-to-text Median Rank 1.0 # 1
video-to-text Mean Rank 10.3 # 8
video-to-text R@5 75.1 # 5
Video Retrieval LSMDC DiffusionRet text-to-video R@1 24.4 # 17
text-to-video R@5 43.1 # 16
text-to-video R@10 54.3 # 12
text-to-video Median Rank 8.0 # 6
video-to-text R@1 23.0 # 9
video-to-text R@5 43.5 # 6
video-to-text R@10 51.5 # 6
video-to-text Median Rank 9.0 # 4
text-to-video Mean Rank 40.7 # 3
video-to-text Mean Rank 40.2 # 4
Video Retrieval MSR-VTT-1kA DiffusionRet+QB-Norm text-to-video Mean Rank 12.1 # 8
text-to-video R@1 48.9 # 20
text-to-video R@5 75.2 # 17
text-to-video R@10 83.1 # 22
text-to-video Median Rank 2.0 # 10
video-to-text R@1 49.3 # 9
video-to-text R@5 74.3 # 12
video-to-text R@10 83.8 # 14
video-to-text Median Rank 2.0 # 7
video-to-text Mean Rank 8.5 # 10
Video Retrieval MSR-VTT-1kA DiffusionRet text-to-video Mean Rank 12.1 # 8
text-to-video R@1 49.0 # 19
text-to-video R@5 75.2 # 17
text-to-video R@10 82.7 # 25
text-to-video Median Rank 2.0 # 10
video-to-text R@1 47.7 # 13
video-to-text R@5 73.8 # 15
video-to-text R@10 84.5 # 9
video-to-text Median Rank 2.0 # 7
video-to-text Mean Rank 8.8 # 11
Video Retrieval MSVD DiffusionRet text-to-video R@1 46.6 # 17
text-to-video R@5 75.9 # 17
text-to-video R@10 84.1 # 16
text-to-video Median Rank 2.0 # 8
text-to-video Mean Rank 15.7 # 14
video-to-text R@1 61.9 # 12
video-to-text R@5 88.3 # 8
video-to-text R@10 92.9 # 8
video-to-text Median Rank 1.0 # 1
video-to-text Mean Rank 4.5 # 8
Video Retrieval MSVD DiffusionRet+QB-Norm text-to-video R@1 47.9 # 14
text-to-video R@5 77.2 # 14
text-to-video R@10 84.8 # 13
text-to-video Mean Rank 15.6 # 13
video-to-text R@1 60.3 # 13
video-to-text R@5 86.4 # 10
video-to-text R@10 92 # 10
video-to-text Median Rank 1.0 # 1
video-to-text Mean Rank 4.5 # 8

Methods


No methods listed for this paper. Add relevant methods here