M3TR: Multi-modal Multi-label Recognition with Transformer

ACM MM 2021  ·  Jiawei Zhao, Yifan Zhao, Jia Li ·

Multi-label image recognition aims to recognize multiple objects simultaneously in one image. Recent ideas to solve this problem have focused on learning dependencies of label co-occurrences to enhance the high-level semantic representations. However, these methods usually neglect the important relations of intrinsic visual structures and face difficulties in understanding contextual relationships. To build the global scope of visual context as well as interactions between visual modality and linguistic modality, we propose the Multi-Modal Multi-label recognition TRansformers (M3TR) with the ternary relationship learning for inter-and intra-modalities. For the intra-modal relationship, we make insightful conjunction of CNNs and Transformers, which embeds visual structures into the high-level features by learning the semantic cross-attention. For constructing the interactions between the visual and linguistic modalities, we propose a linguistic cross-attention to embed the class-wise linguistic information into the visual structure learning, and finally present a linguistic guided enhancement module to enhance the representation of high-level semantics. Experimental evidence reveals that with the collaborative learning of ternary relationship, our proposed M3TR achieves new state-of-the-art on two public multi-label recognition benchmarks.

PDF
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Multi-Label Classification MS-COCO M3TR(ImageNet-21K-P pretraining, resolution 448) mAP 87.5 # 17
Multi-Label Classification PASCAL VOC 2007 M3TR(448×448) mAP 96.5 # 4

Methods


No methods listed for this paper. Add relevant methods here