CAILA: Concept-Aware Intra-Layer Adapters for Compositional Zero-Shot Learning

26 May 2023  ·  Zhaoheng Zheng, Haidong Zhu, Ram Nevatia ·

In this paper, we study the problem of Compositional Zero-Shot Learning (CZSL), which is to recognize novel attribute-object combinations with pre-existing concepts. Recent researchers focus on applying large-scale Vision-Language Pre-trained (VLP) models like CLIP with strong generalization ability. However, these methods treat the pre-trained model as a black box and focus on pre- and post-CLIP operations, which do not inherently mine the semantic concept between the layers inside CLIP. We propose to dive deep into the architecture and insert adapters, a parameter-efficient technique proven to be effective among large language models, into each CLIP encoder layer. We further equip adapters with concept awareness so that concept-specific features of "object", "attribute", and "composition" can be extracted. We assess our method on four popular CZSL datasets, MIT-States, C-GQA, UT-Zappos, and VAW-CZSL, which shows state-of-the-art performance compared to existing methods on all of them.

PDF Abstract

Datasets


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Compositional Zero-Shot Learning MIT-States, generalized split CAILA H-Mean 39.9 # 1
Seen accuracy 51.0 # 1
Test AUC top 1 23.4 # 1
Test AUC top 2 - # 2
Test AUC top 3 - # 2
Unseen accuracy 53.9 # 1
Val AUC top 1 - # 2
Val AUC top 2 - # 2
Val AUC top 3 - # 2

Methods