no code implementations • 31 Aug 2022 • Zilun Zhang, Cuifeng Shen, Yuan Shen, Huixin Xiong, Xinyu Zhou
Although CLIP-like Visual Language Models provide a functional joint feature space for image and text, due to the limitation of the CILP-like model's image input size (e. g., 224), subtle details are lost in the feature representation if we input high-resolution images (e. g., 2240).