ICM is curated for the image-text matching task. Each image has a corresponding caption text, which describes the image in detail. We first use CTR to select the most relevant pairs. Then, human annotators manually perform a 2nd round manual correction, obtaining 400,000 image-text pairs, including 200,000 positive cases and 200,000 negative cases. We keep the ratio of positive and negative pairs consistent in each of the train/val/test sets.