Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval

Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space. However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e.g., corrupt images, fast-paced videos, and non-detailed texts. In this paper, we propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity. Concretely, we first construct a set of various learnable prototypes for each modality to represent the entire semantics subspace. Then Dempster-Shafer Theory and Subjective Logic Theory are utilized to build an evidential theoretical framework by associating evidence with Dirichlet Distribution parameters. The PAU model induces accurate uncertainty and reliable predictions for cross-modal retrieval. Extensive experiments are performed on four major benchmark datasets of MSR-VTT, MSVD, DiDeMo, and MS-COCO, demonstrating the effectiveness of our method. The code is accessible at https://github.com/leolee99/PAU.

PDF Abstract NeurIPS 2023 PDF NeurIPS 2023 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Video Retrieval DiDeMo PAU text-to-video R@1 48.6 # 26
text-to-video R@5 76.0 # 25
text-to-video R@10 84.5 # 22
text-to-video Median Rank 2.0 # 9
text-to-video Mean Rank 12.9 # 7
video-to-text R@1 48.1 # 10
video-to-text R@10 85.7 # 6
video-to-text Median Rank 2.0 # 5
video-to-text Mean Rank 9.8 # 6
video-to-text R@5 74.2 # 8
Video Retrieval MSR-VTT-1kA PAU text-to-video Mean Rank 14.0 # 16
text-to-video R@1 48.5 # 24
text-to-video R@5 72.7 # 28
text-to-video R@10 82.5 # 27
text-to-video Median Rank 2.0 # 10
video-to-text R@1 48.3 # 12
video-to-text R@5 73.0 # 18
video-to-text R@10 83.2 # 18
video-to-text Median Rank 2.0 # 7
video-to-text Mean Rank 9.7 # 15
Video Retrieval MSVD PAU text-to-video R@1 47.3 # 16
text-to-video R@5 77.4 # 13
text-to-video R@10 85.5 # 13
text-to-video Median Rank 2.0 # 8
text-to-video Mean Rank 9.6 # 10
video-to-text R@1 68.9 # 8
video-to-text R@5 93.1 # 4
video-to-text R@10 97.1 # 1
video-to-text Median Rank 1.0 # 1
video-to-text Mean Rank 2.4 # 1

Methods