ML-Decoder: Scalable and Versatile Classification Head

25 Nov 2021  ·  Tal Ridnik, Gilad Sharir, Avi Ben-Cohen, Emanuel Ben-Baruch, Asaf Noy ·

In this paper, we introduce ML-Decoder, a new attention-based classification head. ML-Decoder predicts the existence of class labels via queries, and enables better utilization of spatial data compared to global average pooling. By redesigning the decoder architecture, and using a novel group-decoding scheme, ML-Decoder is highly efficient, and can scale well to thousands of classes. Compared to using a larger backbone, ML-Decoder consistently provides a better speed-accuracy trade-off. ML-Decoder is also versatile - it can be used as a drop-in replacement for various classification heads, and generalize to unseen classes when operated with word queries. Novel query augmentations further improve its generalization ability. Using ML-Decoder, we achieve state-of-the-art results on several classification tasks: on MS-COCO multi-label, we reach 91.4% mAP; on NUS-WIDE zero-shot, we reach 31.1% ZSL mAP; and on ImageNet single-label, we reach with vanilla ResNet50 backbone a new top score of 80.7%, without extra data or distillation. Public code is available at:

PDF Abstract

Results from the Paper

 Ranked #1 on Fine-Grained Image Classification on Stanford Cars (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Image Classification CIFAR-100 Swin-L + ML-Decoder Percentage correct 95.1 # 2
Multi-Label Classification MS-COCO ML-Decoder(TResNet-XL, resolution 640) mAP 91.4 # 1
Multi-Label Classification MS-COCO ML-Decoder(TResNet-L, resolution 640) mAP 91.1 # 3
Multi-label zero-shot learning NUS-WIDE ML-Decoder mAP 31.1 # 1
Multi-Label Classification OpenImages-v6 TResNet-M mAP 86.8 # 2
Fine-Grained Image Classification Stanford Cars TResNet-L + ML-Decoder Accuracy 96.41% # 1


No methods listed for this paper. Add relevant methods here