Deconfounded and Explainable Interactive Vision-Language Retrieval of Complex Scenes

the 29th ACM International Conference on Multimedia 2021 · Junda Wu, Tong Yu, Shuai Li ·

In vision-language retrieval systems, users provide natural language feedback to find target images. Vision-language explanations in the systems can better guide users to provide feedback and thus improve the retrieval. However, developing explainable vision-language retrieval systems can be challenging, due to limited labeled multimodal data. In the retrieval of complex scenes, the issue of limited labeled data can be more severe. With multiple objects in the complex scenes, each user query may not exhaustively describe all objects in the desired image and thus more labeled queries are needed. The issue of limited labeled data can cause data selection biases, and result in spurious correlations learned by the models. When learning spurious correlations, existing explainable models may not be able to accurately extract regions from images and keywords from user queries. In this paper, we discover that deconfounded learning is an important step to provide better vision-language explanations. Thus we propose a deconfounded explainable vision-language retrieval system. By introducing deconfounded learning to pretrain our vision-language model, the spurious correlations in the model can be reduced through interventions by potential confounders. This helps to train more accurate representations and further enable better explainability. Based on explainable retrieval results, we propose novel interactive mechanisms. In such interactions, users can better understand why the system returns particular results and give feedback effectively improving the results. This additional feedback is sample efficient and thus alleviates the data limitation problem. Through extensive experiments, our system achieves about 60% improvements, compared to the state-of-the-art.

PDF Abstract