16 papers with code • 8 benchmarks • 6 datasets
It include two tasks: (1) Image as Query and Text as Targets; (2) Text as Query and Image as Targets.
Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data.
Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks.
First, WIT is the largest multimodal dataset by the number of image-text examples by 3x (at the time of writing).
In order to learn an effective image-text composition for the data in the fashion domain, our model proposes two key components as follows.