Learning From Noisy Correspondence With Tri-Partition for Cross-Modal Matching

Due to high labeling cost, it is inevitable to introduce a certain proportion of noisy correspondence into visual-text datasets, resulting in poor model robustness for cross-modal matching. Although recent methods divide the datasets into clean and noisy pair subsets to yield promising achievements, they still suffer from deep neural networks over-fitting on noisy correspondence. In particular, the similar positive pairs with partially relevant semantic correspondence are easily partitioned into noisy pair subset by mistake without carefully selection, which brings harmful impact on robust learning. Meanwhile, the similar negative pairs with partially relevant semantic correspondence lead to ambiguous distance relations in common space learning, which also damages the stability of performance. To solve the coarse-grained dataset division problem, we propose Correspondence Tri-Partition Rectifier (CTPR) to partition the training set into clean, hard, and noisy pair subsets based on the memorization effect of neural networks and prediction inconsistency. Then, we refine the correspondence labels for each subset to indicate the real semantic correspondence between visual-text pairs. The differences between rectified labels of anchors and hard negatives are recast as the adaptive margin in the improved triplet loss for robust training in a co-teaching manner. To verify the effectiveness and robustness of our method, we conduct experiments by implementing image-text and video-text matching as two showcases. Extensive experiments on Flickr30 K, MS-COCO, MSR-VTT, and LSMDC datasets verify that our method successfully partitions the visual-text pairs according to their semantic correspondence and improves performance under noisy data training.

PDF
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Cross-modal retrieval with noisy correspondence COCO-Noisy CTPR-SGR Image-to-text R@1 79.8 # 3
Image-to-text R@5 96.6 # 5
Image-to-text R@10 98.9 # 2
Text-to-image R@1 63.8 # 6
Text-to-image R@5 91.2 # 1
Text-to-image R@10 96.7 # 1
R-Sum 527 # 1
Cross-modal retrieval with noisy correspondence Flickr30K-Noisy CTPR-SGR Image-to-text R@1 76.2 # 13
Image-to-text R@5 95.8 # 1
Image-to-text R@10 98.3 # 1
Text-to-image R@1 60.5 # 4
Text-to-image R@5 85.2 # 1
Text-to-image R@10 92.7 # 1
R-Sum 508.7 # 1

Methods