no code implementations • 28 Sep 2022 • Xiaohan Zou, Changqiao Wu, Lele Cheng, Zhongyuan Wang
Most existing methods in vision-language retrieval match two modalities by either comparing their global feature vectors which misses sufficient information and lacks interpretability, detecting objects in images or videos and aligning the text with fine-grained features which relies on complicated model designs, or modeling fine-grained interaction via cross-attention upon visual and textual tokens which suffers from inferior efficiency.