MA-CLIP: Towards Modality-Agnostic Contrastive Language-Image Pre-training

29 Sep 2021  ·  Haoxuan You, Luowei Zhou, Bin Xiao, Noel C Codella, Yu Cheng, Ruochen Xu, Shih-Fu Chang, Lu Yuan ·

Large-scale multimodal contrastive pretraining has demonstrated great utility to support high performance in a range of downstream tasks by mapping multiple modalities into a shared embedding space. Typically, this has employed separate encoders for each modality. However, recent work suggest that transformers can support learning across multiple modalities and allow knowledge sharing. Inspired by this, we investigate how to build a modality-shared Contrastive Language-Image Pre-training framework (MS-CLIP). More specifically, we question how many parameters of a transformer model can be shared across modalities during contrastive pre-training, and rigorously study architectural design choices that position the proportion of parameters shared along a spectrum. We observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters. Additionally, we find that light-weight modality-specific parallel adapter modules further improve performance. Experimental results show that the proposed MS-CLIP outperforms OpenAI CLIP by 13\% relatively in zero-shot ImageNet classification (pre-trained on YFCC100M), while simultaneously supporting a reduction of parameters. In addition, our approach outperforms OpenAI CLIP by 1.6 points on a collection of 19 downstream vision tasks. Furthermore, we discover that sharing parameters leads to semantic concepts from different modalities being encoded more closely in the embedding space, facilitating the learning of common semantic structures (e.g., attention patterns) across modalities.

PDF Abstract
No code implementations yet. Submit your code now


Results from the Paper

  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.