TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	REMOVE
Multi-Label Classification	MS-COCO	M3TR(ImageNet-21K-P pretraining, resolution 448)	mAP	87.5	# 17
Multi-Label Classification	PASCAL VOC 2007	M3TR(448×448)	mAP	96.5	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/m3tr-multi-modal-multi-label-recognition-with/multi-label-classification-on-pascal-voc-2007)](https://paperswithcode.com/sota/multi-label-classification-on-pascal-voc-2007?p=m3tr-multi-modal-multi-label-recognition-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/m3tr-multi-modal-multi-label-recognition-with/multi-label-classification-on-ms-coco)](https://paperswithcode.com/sota/multi-label-classification-on-ms-coco?p=m3tr-multi-modal-multi-label-recognition-with)`

M3TR: Multi-modal Multi-label Recognition with Transformer

ACM MM 2021 · Jiawei Zhao, Yifan Zhao, Jia Li ·

Multi-label image recognition aims to recognize multiple objects simultaneously in one image. Recent ideas to solve this problem have focused on learning dependencies of label co-occurrences to enhance the high-level semantic representations. However, these methods usually neglect the important relations of intrinsic visual structures and face difficulties in understanding contextual relationships. To build the global scope of visual context as well as interactions between visual modality and linguistic modality, we propose the Multi-Modal Multi-label recognition TRansformers (M3TR) with the ternary relationship learning for inter-and intra-modalities. For the intra-modal relationship, we make insightful conjunction of CNNs and Transformers, which embeds visual structures into the high-level features by learning the semantic cross-attention. For constructing the interactions between the visual and linguistic modalities, we propose a linguistic cross-attention to embed the class-wise linguistic information into the visual structure learning, and finally present a linguistic guided enhancement module to enhance the representation of high-level semantics. Experimental evidence reveals that with the collaborative learning of ternary relationship, our proposed M3TR achieves new state-of-the-art on two public multi-label recognition benchmarks.

PDF

Code

Add Remove Mark official

iCVTEAM/M3TR

Tasks

Add Remove

Multi-Label Classification

Datasets

MS COCO

PASCAL VOC 2007

Results from the Paper

Add Remove

Ranked #4 on Multi-Label Classification on PASCAL VOC 2007

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Multi-Label Classification	MS-COCO	M3TR(ImageNet-21K-P pretraining, resolution 448)	mAP	87.5	# 17	Compare
Multi-Label Classification	PASCAL VOC 2007	M3TR(448×448)	mAP	96.5	# 4	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

M3TR: Multi-modal Multi-label Recognition with Transformer

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove